maximum-Margin Matrix Factorization: Difference between revisions
Line 21: | Line 21: | ||
<math>\|X\|_{F}^2 = \sum x_{ij}^2 = Tr(XX^T) = \sum \lambda_i^2</math> | <math>\|X\|_{F}^2 = \sum x_{ij}^2 = Tr(XX^T) = \sum \lambda_i^2</math> | ||
(where Tr | (where <math>Tr()</math> is ''Trace Function'' and <math>\lambda</math>s are singular values of X) | ||
== Notion of Margin in Collaborative Filtering == | == Notion of Margin in Collaborative Filtering == | ||
Assume one of the factors is fixed and the goal is to predict the other factor. This is the famous problem of linear prediction or SVM; assume U is fixed, so predicting each column of X is in fact solving a SVM for U to find a row in V. Recall in SVM, to maximize the margin, the norm of the linear separator <math>\|\beta\|^2</math> should be minimized. Therefore, predicting Y with the maximum margin leads to minimizing the norm of the factor V or <math>\|V\|_{F}^2</math>. However the problem here is to predict both U and V together (which is called '''collaborative filtering'''), but at each step one factor can be imagined fixed and so minimizing the norm of the other one gives the maximum margin for this prediction. | Assume one of the factors is fixed and the goal is to predict the other factor. This is the famous problem of linear prediction or SVM; assume U is fixed, so predicting each column of X is in fact solving a SVM for U to find a row in V. Recall in SVM, to maximize the margin, the norm of the linear separator <math>\|\beta\|^2</math> should be minimized. Therefore, predicting Y with the maximum margin leads to minimizing the norm of the factor V or <math>\|V\|_{F}^2</math>. However the problem here is to predict both U and V together (which is called '''collaborative filtering'''), but at each step one factor can be imagined fixed and so minimizing the norm of the other one gives the maximum margin for this prediction. |
Revision as of 14:32, 18 July 2009
Problem Definition
Assume Y as a n by m matrix containing n user preferences about m movies, such that [math]\displaystyle{ y_{ij} = +1 }[/math] if user i likes movie j, and [math]\displaystyle{ y_{ij} = -1 }[/math] if he dislikes it. Due to lack of knowledge about the users’ opinions, Y is partially observable: it has some [math]\displaystyle{ \pm1 }[/math] and the other cells are unknown. The main goal is to find matrix X such than it preserves the knowledge in Y, and predicts the value of its unknown cells.
Predicting the unknown values in this problem is possible because the rows and columns of Y are assumed to be related to each other. One can translate this relation by the rank of X; in other word, the rank of X indicates the number of features affecting the values in Y. Therefore, minimizing the number of features or the rank of X is equal to finding a simple relation in the given knowledge of Y. In addition, keeping the number of features low is one way to avoid the problem of over-fitting in the predication process.
Hinge Loss Function
If Y has been fully observed and there is no unknown, the goal is to find X as a simple (low-rank) representation of Y. By choosing the sum-squared error loss function, one can easily use SVD technique to minimize the loss and find X. Despite the fact this solution is suitable for lowering the rank; it does not work for the case we have unknowns in Y (due to several local minima problem). In addition one may think about other loss functions instead of SSE (although SSE is convex and has a nice behavior in the optimization problems). For example, in this problem Hinge loss works fine, specially considering the values in Y are restricted to be [math]\displaystyle{ \pm1 }[/math]. Hinge loss here is defined as:
Hinge Loss [math]\displaystyle{ (X | Y) = \displaystyle \sum_{i,j \in S} \max (0, 1 - x_{ij} \cdot y_{ij}) }[/math] (where [math]\displaystyle{ S = \{(i, j) | y_{ij} }[/math] is known[math]\displaystyle{ \} }[/math])
Prediction Using Matrix Factorization
Another way to tackle this problem is to use Matrix Factorization. In matrix factorization one tries to find a prediction [math]\displaystyle{ X = UV^T }[/math] such that the rank of each factor (U and V) is low. In addition, in this problem U and V are meaningful; the ith row of U indicates the importance (weights) of the features for ith user, and so the ith row of V is the characterization of the ith movie based on its features. Here, lowering the rank of X is equal to lowering the rank of its factors, but unfortunately the main problem here is that, rank optimization is not convex.
Frobenius Norm
Instead of using the rank of each factor, the Frobenius Norm of each factor can be used; in fact, this norm is closely related to the rank of matrix. Moreover, in this way the optimization problem will be convex, and one may get benefit of using common optimization techniques. The Frobenius norm is defined as:
[math]\displaystyle{ \|X\|_{F}^2 = \sum x_{ij}^2 = Tr(XX^T) = \sum \lambda_i^2 }[/math] (where [math]\displaystyle{ Tr() }[/math] is Trace Function and [math]\displaystyle{ \lambda }[/math]s are singular values of X)
Notion of Margin in Collaborative Filtering
Assume one of the factors is fixed and the goal is to predict the other factor. This is the famous problem of linear prediction or SVM; assume U is fixed, so predicting each column of X is in fact solving a SVM for U to find a row in V. Recall in SVM, to maximize the margin, the norm of the linear separator [math]\displaystyle{ \|\beta\|^2 }[/math] should be minimized. Therefore, predicting Y with the maximum margin leads to minimizing the norm of the factor V or [math]\displaystyle{ \|V\|_{F}^2 }[/math]. However the problem here is to predict both U and V together (which is called collaborative filtering), but at each step one factor can be imagined fixed and so minimizing the norm of the other one gives the maximum margin for this prediction.