maximum-Margin Matrix Factorization: Difference between revisions
No edit summary |
|||
Line 9: | Line 9: | ||
If Y has been fully observed and there is no unknown, the goal is to find X as a simple (low-rank) representation of Y. By choosing the sum-squared error loss function, one can easily use SVD technique to minimize the loss and find X. Despite the fact this solution is suitable for lowering the rank; it does not work for the case we have unknowns in Y (due to several local minima problem). In addition one may think about other loss functions instead of SSE (although SSE is ''convex'' and has a nice behavior in the optimization problems). For example, in this problem '''Hinge loss''' works fine, specially considering the values in Y are restricted to be <math>\pm1</math>. Hinge loss here is defined as: | If Y has been fully observed and there is no unknown, the goal is to find X as a simple (low-rank) representation of Y. By choosing the sum-squared error loss function, one can easily use SVD technique to minimize the loss and find X. Despite the fact this solution is suitable for lowering the rank; it does not work for the case we have unknowns in Y (due to several local minima problem). In addition one may think about other loss functions instead of SSE (although SSE is ''convex'' and has a nice behavior in the optimization problems). For example, in this problem '''Hinge loss''' works fine, specially considering the values in Y are restricted to be <math>\pm1</math>. Hinge loss here is defined as: | ||
<math>Hinge(X | Y) = \displaystyle \sum_{i,j \in S} \max (0, 1 - x_{ij} \cdot y_{ij})</math> | <math>Hinge(X | Y) = \displaystyle \sum_{(i,j) \in S} \max (0, 1 - x_{ij} \cdot y_{ij})</math> | ||
(where <math>S = \{(i, j) | y_{ij}</math> is known<math>\}</math>) | (where <math>S = \{(i, j) | y_{ij}</math> is known<math>\}</math>) | ||
Line 41: | Line 41: | ||
where <math>\|X\|_{T}</math> is the '''Trace Norm''' of X which is defined as: <br /> | where <math>\|X\|_{T}</math> is the '''Trace Norm''' of X which is defined as: <br /> | ||
<math> \|X\|_{T} = \sum |\lambda_i| = Tr(\sqrt{XX^T})</math> <br /> | <math> \|X\|_{T} = \sum |\lambda_i| = Tr(\sqrt{XX^T})</math> <br /> | ||
In addition, by using SVD of <math>X = A | In addition, by using SVD of <math>X = A \Sigma B^T</math>, one can see <math>U = A \sqrt{\Sigma}</math> and | ||
<math>V = B\sqrt{\Sigma}</math> satisfy this lemma. | |||
Base on the lemma 1 the optimization problem can be reformulated as: | Base on the lemma 1 the optimization problem can be reformulated as: | ||
Line 69: | Line 70: | ||
As it is possible in SVM that no linear separator can be found to satisfy the constraints (classes are not linearly separable), there exists some Y for which no factorization preserves all the knowledge from Y. So the same solution as soft-margin SVM can be used here, and one can add slack variables to the loss function: | As it is possible in SVM that no linear separator can be found to satisfy the constraints (classes are not linearly separable), there exists some Y for which no factorization preserves all the knowledge from Y. So the same solution as soft-margin SVM can be used here, and one can add slack variables to the loss function: | ||
<math>\displaystyle \min_X \|X\|_{T} + c \cdot \sum \xi_{ij}</math> s.t. <math>\forall (i,j) \in S: \xi_{ij} \geq 0</math> | |||
and <math>x_{ij} \cdot \y_{ij} \geq 1 - \xi_{ij}</math> | |||
So far we have an optimization problem which is convex. To solve this problem we should change its shape to one of the known convex optimization problems. Next lemma shows how Semi-Definite Programming can be used to solve this optimization: | So far we have an optimization problem which is convex. To solve this problem we should change its shape to one of the known convex optimization problems. Next lemma shows how Semi-Definite Programming can be used to solve this optimization: |
Revision as of 16:54, 18 July 2009
Problem Definition
Assume Y as a n by m matrix containing n user preferences about m movies, such that [math]\displaystyle{ y_{ij} = +1 }[/math] if user i likes movie j, and [math]\displaystyle{ y_{ij} = -1 }[/math] if he dislikes it. Due to lack of knowledge about the users’ opinions, Y is partially observable: it has some [math]\displaystyle{ \pm1 }[/math] and the other cells are unknown. The main goal is to find matrix X such than it preserves the knowledge in Y, and predicts the value of its unknown cells.
Predicting the unknown values in this problem is possible because the rows and columns of Y are assumed to be related to each other. One can translate this relation by the rank of X; in other word, the rank of X indicates the number of features affecting the values in Y. Therefore, minimizing the number of features or the rank of X is equal to finding a simple relation in the given knowledge of Y. In addition, keeping the number of features low is one way to avoid the problem of over-fitting in the predication process.
Hinge Loss Function
If Y has been fully observed and there is no unknown, the goal is to find X as a simple (low-rank) representation of Y. By choosing the sum-squared error loss function, one can easily use SVD technique to minimize the loss and find X. Despite the fact this solution is suitable for lowering the rank; it does not work for the case we have unknowns in Y (due to several local minima problem). In addition one may think about other loss functions instead of SSE (although SSE is convex and has a nice behavior in the optimization problems). For example, in this problem Hinge loss works fine, specially considering the values in Y are restricted to be [math]\displaystyle{ \pm1 }[/math]. Hinge loss here is defined as:
[math]\displaystyle{ Hinge(X | Y) = \displaystyle \sum_{(i,j) \in S} \max (0, 1 - x_{ij} \cdot y_{ij}) }[/math] (where [math]\displaystyle{ S = \{(i, j) | y_{ij} }[/math] is known[math]\displaystyle{ \} }[/math])
Prediction Using Matrix Factorization
Another way to tackle this problem is to use Matrix Factorization. In matrix factorization one tries to find a prediction [math]\displaystyle{ X = UV^T }[/math] such that the rank of each factor (U and V) is low. In addition, in this problem U and V are meaningful; the ith row of U indicates the importance (weights) of the features for ith user, and so the ith row of V is the characterization of the ith movie based on its features. Here, lowering the rank of X is equal to lowering the rank of its factors, but unfortunately the main problem here is that, rank optimization is not convex.
Frobenius Norm
Instead of using the rank of each factor, the Frobenius Norm of each factor can be used; in fact, this norm is closely related to the rank of matrix. Moreover, in this way the optimization problem will be convex, and one may get benefit of using common optimization techniques. The Frobenius norm is defined as:
[math]\displaystyle{ \|X\|_{F}^2 = \sum x_{ij}^2 = Tr(XX^T) = \sum \lambda_i^2 }[/math] (where [math]\displaystyle{ Tr() }[/math] is Trace Function and [math]\displaystyle{ \lambda }[/math]s are singular values of X)
Notion of Margin in Collaborative Filtering
Assume one of the factors is fixed and the goal is to predict the other factor. This is the famous problem of linear prediction or SVM; assume U is fixed, so predicting each column of X is in fact solving a SVM for U to find a row in V. Recall in SVM, to maximize the margin, the norm of the linear separator [math]\displaystyle{ \|\beta\|^2 }[/math] should be minimized. Therefore, predicting Y with the maximum margin leads to minimizing the norm of the factor V or [math]\displaystyle{ \|V\|_{F}^2 }[/math]. However the problem here is to predict both U and V together (which is called collaborative filtering), but at each step one factor can be imagined fixed and so minimizing the norm of the other one gives the maximum margin for this prediction.
The Optimization Problem and Trace Norm
So far it has been shown that the optimization problem is to find the factors with minimum norm such that the prediction has a low loss:
[math]\displaystyle{ \displaystyle \min_{X = UV^T} (\|U\|_{F}^2 + \|V\|_{F}^2) + c \cdot Hinge(X|Y) }[/math]
This optimization problem is difficult because both objective function and constraints are not linear. The following lemma helps to change the shape of the optimization problem in a way it becomes easier to solve:
Lemma 1:
[math]\displaystyle{ \displaystyle \min_{X = UV^T} \frac{1}{2}(\|U\|_{F}^2 + \|V\|_{F}^2) =
\displaystyle \min_{X = UV^T} (\|U\|_{F} \cdot \|V\|_{F}) = \|X\|_{T} }[/math]
where [math]\displaystyle{ \|X\|_{T} }[/math] is the Trace Norm of X which is defined as:
[math]\displaystyle{ \|X\|_{T} = \sum |\lambda_i| = Tr(\sqrt{XX^T}) }[/math]
In addition, by using SVD of [math]\displaystyle{ X = A \Sigma B^T }[/math], one can see [math]\displaystyle{ U = A \sqrt{\Sigma} }[/math] and
[math]\displaystyle{ V = B\sqrt{\Sigma} }[/math] satisfy this lemma.
Base on the lemma 1 the optimization problem can be reformulated as:
[math]\displaystyle{ \displaystyle \min_X \|X\|_{T} + c \cdot Hinge(X|Y) }[/math]
Relation Between Rank And Trace Norm
In the literature, the notion of trace norm has been widely used instead of dealing with the rank of matrices. The next theorem explains the relation between the rank and the trace norm:
Theorem 1:
The convex envelope (smallest convex bounding function) of the rank function, on matrices with unit spectral norm, is the trace norm function.
Spectral Norm of a matrix is the absolute value of its largest eigenvalue.
In addition it can be easily shown the relation between the norm, trace norm and rank of a matrix:
[math]\displaystyle{ \forall X: \|X\|_{F} \leq \|X\|_{T} \leq \sqrt{Rank(X)} \cdot \|X\|_{F} }[/math]
Base on this relation, and the fact that the trace norm is a convex function, it can be shown:
[math]\displaystyle{ \{X| \|X\|_{F} \leq \alpha\} = conv \{uv^T| u \in R^n , v \in R^m , |u| = |v| = \alpha\} }[/math]
which shows the set of matrices with a bounded trace norm is Convex, and it seems this set has the lowest rank matrices on its boundary. Thus, this optimization problem is in fact the process of searching in a convex set to optimize a convex function.
Soft Margin Optimization
As it is possible in SVM that no linear separator can be found to satisfy the constraints (classes are not linearly separable), there exists some Y for which no factorization preserves all the knowledge from Y. So the same solution as soft-margin SVM can be used here, and one can add slack variables to the loss function:
[math]\displaystyle{ \displaystyle \min_X \|X\|_{T} + c \cdot \sum \xi_{ij} }[/math] s.t. [math]\displaystyle{ \forall (i,j) \in S: \xi_{ij} \geq 0 }[/math] and [math]\displaystyle{ x_{ij} \cdot \y_{ij} \geq 1 - \xi_{ij} }[/math]
So far we have an optimization problem which is convex. To solve this problem we should change its shape to one of the known convex optimization problems. Next lemma shows how Semi-Definite Programming can be used to solve this optimization:
For any X in Rnm and t in R: TNorm X < t there exist A, B s.t. M = [A X; X’ B] >= 0 and Trace M <= 2t
Therefore the last optimization problem can be formulated in a SDP optimization:
Min Trace M + c * sum Slacks s.t. M >= 0 and my >= 1 – slack