maximum-Margin Matrix Factorization: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 1: Line 1:
== Problem Definition ==
== Problem Definition ==


Assume Y as a <math>n \times m</math> matrix containing n user preferences about m movies, such that <math>y_{ij} = +1</math> if user i likes movie j, and <math>y_{ij} = -1</math> if he dislikes it. Due to lack of knowledge about the users’ opinions, Y is partially observable: it has some <math>\pm1</math> and the other cells are unknown. The main goal is to find matrix X such than it preserves the knowledge in Y, and predicts the value of its unknown cells.
Assume Y is an <math>n \times m</math> matrix containing n user preferences about m movies, such that <math>y_{ij} = +1</math> if user i likes movie j, and <math>y_{ij} = -1</math> if he/she dislikes it. Due to the lack of knowledge about the user's opinions, Y is partially observable: it has some <math>\pm1</math> and the other cells are unknown. The main goal is to find matrix X such than it preserves the knowledge in Y, and predicts the value of its unknown cells.


Predicting the unknown values in this problem is possible because the rows and columns of Y are assumed to be related to each other. One can translate this ''relation'' by the rank of X; in other word, the rank of X indicates the number of ''features'' affecting the values in Y. Therefore, minimizing the number of features or the rank of X is equal to finding a ''simple relation'' in the given knowledge of Y. In addition, keeping the number of features low is one way to avoid the problem of over-fitting in the prediction process.  If the rank of X is equal to the rank of Y, this gives a trivial solution of X = Y.
Predicting the unknown values in this problem is possible because the rows and columns of Y are assumed to be related to each other. One can translate this ''relation'' by the rank of X; in other word, the rank of X indicates the number of ''features'' affecting the values in Y. Therefore, minimizing the number of features or the rank of X is equal to finding a ''simple relation'' in the given knowledge of Y. In addition, keeping the number of features low is one way to avoid the problem of over-fitting in the prediction process.  If the rank of X is equal to the rank of Y, this gives a trivial solution of X = Y.

Revision as of 19:36, 31 July 2009

Problem Definition

Assume Y is an [math]\displaystyle{ n \times m }[/math] matrix containing n user preferences about m movies, such that [math]\displaystyle{ y_{ij} = +1 }[/math] if user i likes movie j, and [math]\displaystyle{ y_{ij} = -1 }[/math] if he/she dislikes it. Due to the lack of knowledge about the user's opinions, Y is partially observable: it has some [math]\displaystyle{ \pm1 }[/math] and the other cells are unknown. The main goal is to find matrix X such than it preserves the knowledge in Y, and predicts the value of its unknown cells.

Predicting the unknown values in this problem is possible because the rows and columns of Y are assumed to be related to each other. One can translate this relation by the rank of X; in other word, the rank of X indicates the number of features affecting the values in Y. Therefore, minimizing the number of features or the rank of X is equal to finding a simple relation in the given knowledge of Y. In addition, keeping the number of features low is one way to avoid the problem of over-fitting in the prediction process. If the rank of X is equal to the rank of Y, this gives a trivial solution of X = Y.

Hinge Loss Function

If Y has been fully observed and there is no unknown, the goal is to find X as a simple (low-rank) representation of Y. By choosing the sum-squared error loss function, one can easily use SVD technique to minimize the loss and find X. Despite the fact that this solution is suitable for lowering the rank; it does not work for the case when we have unknowns in Y (due to several local minima problem). In addition one may think about other loss functions instead of SSE (although SSE is convex and has a nice behavior in the optimization problems). For example, in this problem Hinge loss works fine, specially considering the values in Y are restricted to be [math]\displaystyle{ \pm1 }[/math]. Hinge loss here is defined as:

[math]\displaystyle{ \textrm{Hinge}(X | Y) = \displaystyle \sum_{(i,j) \in S} \max (0, 1 - x_{ij} \cdot y_{ij}) }[/math] (where [math]\displaystyle{ \, S = \{(i, j) | y_{ij} }[/math] is known[math]\displaystyle{ \} }[/math])

Prediction Using Matrix Factorization

Another way to tackle this problem is to use Matrix Factorization. In matrix factorization one tries to find a prediction [math]\displaystyle{ X = UV^T }[/math] such that the rank of each factor (U and V) is low. In addition, in this problem U and V are meaningful; the ith row of U indicates the importance (weights) of the features for ith user, and so the ith row of V is the characterization of the ith movie based on its features. Here, lowering the rank of X is equal to lowering the rank of its factors, but unfortunately the main problem here is that, rank optimization is not convex.

Frobenius Norm

Instead of using the rank of each factor, the Frobenius Norm of each factor can be used; in fact, this norm is closely related to the rank of matrix. Moreover, in this way the optimization problem will be convex, and one may get the benefit of using common optimization techniques. The Frobenius norm is defined as:

[math]\displaystyle{ \|X\|_{F}^2 = \sum x_{ij}^2 = Tr(XX^T) = \sum \lambda_i^2 }[/math] (where [math]\displaystyle{ Tr() }[/math] is Trace Function and [math]\displaystyle{ \lambda }[/math]s are singular values of X)

Notion of Margin in Collaborative Filtering

Assume one of the factors is fixed and the goal is to predict the other factor. This is the famous problem of linear prediction or SVM. Assume U is fixed, so predicting each column of X is, in fact, solving a SVM for U to find a row in V. Recall in SVM, to maximize the margin the norm of the linear separator [math]\displaystyle{ \|\beta\|^2 }[/math] should be minimized. Therefore, predicting Y with the maximum margin is equivalent to minimizing the norm of the factor V or [math]\displaystyle{ \|V\|_{F}^2 }[/math]. However, the problem here is to predict both U and V together (which is called collaborative filtering), but at each step, one factor can be thought of as fixed, so minimizing the norm of the other factor gives the maximum margin for this prediction.

The Optimization Problem and Trace Norm

So far it has been shown that the optimization problem is to find the factors with minimum norm such that the prediction has a low loss:

[math]\displaystyle{ \displaystyle \min_{X = UV^T} (\|U\|_{F}^2 + \|V\|_{F}^2) + c \cdot \textrm{Hinge}(X|Y) }[/math]

This optimization problem is difficult because both objective function and constraints are not linear. The following lemma helps to change the shape of the optimization problem in a way it becomes easier to solve:

Lemma 1:
[math]\displaystyle{ \displaystyle \min_{X = UV^T} \frac{1}{2}(\|U\|_{F}^2 + \|V\|_{F}^2) = \displaystyle \min_{X = UV^T} (\|U\|_{F} \cdot \|V\|_{F}) = \|X\|_{T} }[/math]

where [math]\displaystyle{ \|X\|_{T} }[/math] is the Trace Norm of X which is defined as:
[math]\displaystyle{ \|X\|_{T} = \sum |\lambda_i| = Tr(\sqrt{XX^T}) }[/math]
In addition, by using SVD of [math]\displaystyle{ X = A \Sigma B^T }[/math], one can see [math]\displaystyle{ U = A \sqrt{\Sigma} }[/math] and [math]\displaystyle{ V = B\sqrt{\Sigma} }[/math] satisfy this lemma.

Based on the lemma 1 the optimization problem can be reformulated as:

[math]\displaystyle{ \displaystyle \min_X \|X\|_{T} + c \cdot \textrm{Hinge}(X|Y) }[/math]

Relation Between Rank And Trace Norm

In the literature, the notion of trace norm has been widely used instead of dealing with the rank of matrices. The next theorem explains the relation between the rank and the trace norm:

Theorem 1:
The convex envelope (smallest convex bounding function) of the rank function, on matrices with unit spectral norm, is the trace norm function.
Spectral Norm of a matrix is the absolute value of its largest eigenvalue.

In addition it can be easily shown the relation between the norm, trace norm and rank of a matrix:

[math]\displaystyle{ \forall X: \|X\|_{F} \leq \|X\|_{T} \leq \sqrt{Rank(X)} \cdot \|X\|_{F} }[/math]

Based on this relation, and the fact that the trace norm is a convex function, it can be shown:

[math]\displaystyle{ \{X| \|X\|_{F} \leq \alpha\} = conv \{uv^T| u \in \Re^n , v \in \Re^m , |u| = |v| = \alpha\} }[/math]

which shows the set of matrices with a bounded trace norm is Convex, and it seems this set has the lowest rank matrices on its boundary. Thus, this optimization problem is in fact the process of searching in a convex set to optimize a convex function.

Soft Margin Optimization

As it is possible in SVM that no linear separator can be found to satisfy the constraints (classes are not linearly separable), there exists some Y for which no factorization preserves all the knowledge from Y. So the same solution as soft-margin SVM can be used here, and one can add slack variables to the loss function:

[math]\displaystyle{ \displaystyle \min_X \|X\|_{T} + c \cdot \displaystyle \sum_{(i,j) \in S} \xi_{ij} }[/math]
s.t. [math]\displaystyle{ \forall (i,j) \in S: \xi_{ij} \geq 0 }[/math] , [math]\displaystyle{ x_{ij} \cdot y_{ij} \geq 1 - \xi_{ij} }[/math]

Using Semi-Definite Programming

Now this optimization problem is easy to understand, and also it is convex with linear constrains. To solve, it should be reformulated to one of the known convex optimization problems. Next lemma shows how Semi-Definite Programming can be used to solve this optimization:

Lemma 2:
[math]\displaystyle{ \forall X \in \Re^{n \cdot m} }[/math] and [math]\displaystyle{ t \in \Re : \|X\|_{T} \leq t \Longleftrightarrow \exists A, B }[/math] s.t. [math]\displaystyle{ M = \left( \begin{array}{cc} A & X \\ X^T & B \\ \end{array} \right) \succeq 0 }[/math] and [math]\displaystyle{ Tr(M) \leq 2t }[/math]

Therefore the last optimization problem can be formulated as a linear, convex SDP optimization problem:

[math]\displaystyle{ \displaystyle \min_{M \succeq 0} Tr(M) + c \cdot \displaystyle \sum_{(i,j) \in S} \xi_{ij} }[/math]
s.t. [math]\displaystyle{ M = \left( \begin{array}{cc} A & X \\ X^T & B \\ \end{array} \right)\succeq 0 }[/math] and [math]\displaystyle{ \forall (i,j) \in S: \xi_{ij} \geq 0 }[/math] , [math]\displaystyle{ x_{ij} \cdot y_{ij} \geq 1 - \xi_{ij} }[/math]

which has a dual form with a simple structure. Also the prediction can be done using a solution to the dual problem directly.