maximum-Margin Matrix Factorization: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
Assume Y as a n by m matrix containing n user preferences about m movies, such that <math>Y_{ij} = +1</math> if user i likes movie j, and <math>Y_{ij} = -1</math> if he dislikes it. Due to lack of knowledge about the users’ opinions, Y is partially observable: it has some <math>pm</math> and the other cells are unknown. The main goal is to find matrix X such than it preserves the knowledge in Y, and predicts the value of its unknown cells.
Assume Y as a n by m matrix containing n user preferences about m movies, such that <math>y_{ij} = +1</math> if user i likes movie j, and <math>y_{ij} = -1</math> if he dislikes it. Due to lack of knowledge about the users’ opinions, Y is partially observable: it has some <math>pm</math> and the other cells are unknown. The main goal is to find matrix X such than it preserves the knowledge in Y, and predicts the value of its unknown cells.


Predicting the unknown values in this problem is possible because the rows and columns of Y are assumed to be related to each other. One can translate this ''relation'' by the rank of X; in other word, the rank of X indicates the number of '''features''' affecting the values in Y. Therefore, minimizing the number of features or the rank of X is equal to finding a ''simple'' relation in the given knowledge of Y. In addition, keeping the number of features low is one way to avoid the problem of over-fitting in the predication process.
Predicting the unknown values in this problem is possible because the rows and columns of Y are assumed to be related to each other. One can translate this ''relation'' by the rank of X; in other word, the rank of X indicates the number of ''features'' affecting the values in Y. Therefore, minimizing the number of features or the rank of X is equal to finding a ''simple relation'' in the given knowledge of Y. In addition, keeping the number of features low is one way to avoid the problem of over-fitting in the predication process.


In the first step, assume we have Y with no unknowns, and the goal is to find X as a simple (low-rank) representation of Y. By choosing the sum-squared error loss function, one can easily use SVD technique to minimize the loss and find X. Despite the fact this solution is suitable for lowering the rank; it does not work for the case we have unknowns in Y (due to several local minima problem).
In the first step, assume we have Y with no unknowns, and the goal is to find X as a simple (low-rank) representation of Y. By choosing the sum-squared error loss function, one can easily use SVD technique to minimize the loss and find X. Despite the fact this solution is suitable for lowering the rank; it does not work for the case we have unknowns in Y (due to several local minima problem). In addition one may think about other loss functions instead of SSE (although SSE is ''convex'' and has a nice behavior in the optimization problems). For example, in this problem '''Hinge loss''' works fine, specially considering the values in Y are restricted to be <math>pm</math>. Hinge loss here is defined as:
 
Hinge Loss <math>(X | Y) = \displaystyle \sum_{i,j \in S}  \max (0, 1 - x_{ij} \cdot y_{ij})</math>.
Where S = {i,j | <math>y_{ij}</math> is known}.
 
Another way to tackle this problem is to use '''Matrix Factorization'''. In matrix factorization one tries to find a prediction X = UV’ such that the rank of each factor (U and V) is low. In addition, in this problem U and V are meaningful; the i<sub>th</sub> row of U indicates the importance (weights) of the features for i<sub>th</sub> user, and so the i<sub>th</sub> row of V is the characterization of the i<sub>th</sub> movie based on its features. Here, lowering the rank of X is equal to lowering the rank of its factors, but unfortunately the main problem here is that, rank optimization is not ''convex''.
 
Instead of using the rank of each factor, the '''Frobenius Norm''' of each factor can be used; in fact, this norm is closely related to the rank of matrix. Moreover, in this way the optimization problem will be convex, and one may get benefit of using common optimization techniques. The Frobenius norm is defined as:
 
<math>\|X\|_{F}^2 = \sum x_{ij}^2 = Tr(XX^T) = \sum \lambda_i^2</math>
Where Tr means Trace and <math>\lambda</math>s are singular values of X.
 
Assume one of the factors is fixed and we want to predict the other one. This is the famous problem of linear prediction or SVM; assume U is fixed, so predicting each column of X is in fact solving a SVM for U to find a row in V. Recall in SVM, to maximize the margin, we minimize the norm of the linear separator Beta. Therefore predicting Y with the maximum margin leads to minimizing the norm of the factor V. However we are interested in predicting both U and V together (which we call collaborative filtering), but at each step one can imagine one factor fixed and minimizing the norm of the other one gives the maximum margin in this prediction.

Revision as of 14:59, 18 July 2009

Assume Y as a n by m matrix containing n user preferences about m movies, such that [math]\displaystyle{ y_{ij} = +1 }[/math] if user i likes movie j, and [math]\displaystyle{ y_{ij} = -1 }[/math] if he dislikes it. Due to lack of knowledge about the users’ opinions, Y is partially observable: it has some [math]\displaystyle{ pm }[/math] and the other cells are unknown. The main goal is to find matrix X such than it preserves the knowledge in Y, and predicts the value of its unknown cells.

Predicting the unknown values in this problem is possible because the rows and columns of Y are assumed to be related to each other. One can translate this relation by the rank of X; in other word, the rank of X indicates the number of features affecting the values in Y. Therefore, minimizing the number of features or the rank of X is equal to finding a simple relation in the given knowledge of Y. In addition, keeping the number of features low is one way to avoid the problem of over-fitting in the predication process.

In the first step, assume we have Y with no unknowns, and the goal is to find X as a simple (low-rank) representation of Y. By choosing the sum-squared error loss function, one can easily use SVD technique to minimize the loss and find X. Despite the fact this solution is suitable for lowering the rank; it does not work for the case we have unknowns in Y (due to several local minima problem). In addition one may think about other loss functions instead of SSE (although SSE is convex and has a nice behavior in the optimization problems). For example, in this problem Hinge loss works fine, specially considering the values in Y are restricted to be [math]\displaystyle{ pm }[/math]. Hinge loss here is defined as:

Hinge Loss [math]\displaystyle{ (X | Y) = \displaystyle \sum_{i,j \in S} \max (0, 1 - x_{ij} \cdot y_{ij}) }[/math]. Where S = {i,j | [math]\displaystyle{ y_{ij} }[/math] is known}.

Another way to tackle this problem is to use Matrix Factorization. In matrix factorization one tries to find a prediction X = UV’ such that the rank of each factor (U and V) is low. In addition, in this problem U and V are meaningful; the ith row of U indicates the importance (weights) of the features for ith user, and so the ith row of V is the characterization of the ith movie based on its features. Here, lowering the rank of X is equal to lowering the rank of its factors, but unfortunately the main problem here is that, rank optimization is not convex.

Instead of using the rank of each factor, the Frobenius Norm of each factor can be used; in fact, this norm is closely related to the rank of matrix. Moreover, in this way the optimization problem will be convex, and one may get benefit of using common optimization techniques. The Frobenius norm is defined as:

[math]\displaystyle{ \|X\|_{F}^2 = \sum x_{ij}^2 = Tr(XX^T) = \sum \lambda_i^2 }[/math] Where Tr means Trace and [math]\displaystyle{ \lambda }[/math]s are singular values of X.

Assume one of the factors is fixed and we want to predict the other one. This is the famous problem of linear prediction or SVM; assume U is fixed, so predicting each column of X is in fact solving a SVM for U to find a row in V. Recall in SVM, to maximize the margin, we minimize the norm of the linear separator Beta. Therefore predicting Y with the maximum margin leads to minimizing the norm of the factor V. However we are interested in predicting both U and V together (which we call collaborative filtering), but at each step one can imagine one factor fixed and minimizing the norm of the other one gives the maximum margin in this prediction.