a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 17: Line 17:
==Learning Compact Operators with Spectral Regularization==
==Learning Compact Operators with Spectral Regularization==


Given <math>N</math> observations (<math> \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>t_i</math> is the rating of user <math>x_i</math> for object <math>y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>x</math> rating for any object <math>y</math>. <math>x_i</math> and <math>y_i</math> correspond to <math>i</math>-th available rating.
Given <math>N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>t_i</math> is the rating of user <math>x_i</math> for object <math>y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>x</math> rating for any object <math>y</math>. <math>x_i</math> and <math>y_i</math> correspond to <math>i</math>-th available rating.


In this paper, learning preference functions <math>f(.,.)</math>, or bilinear of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some compact operator <math>F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.
In this paper, learning preference functions <math>f(.,.)</math>, or bilinear of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some compact operator <math>F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

Revision as of 17:57, 3 December 2010

Introduction

Collaborative filtering (CF) aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences.

Regularization-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors, Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727. AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. The problem will be inferring the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its trace norm <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336, Cambridge, MA, 2005. MIT Press. </ref> are two approaches to solve the resulted optimziation problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The previledge and novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods. These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

Learning Compact Operators with Spectral Regularization

Given [math]\displaystyle{ N }[/math] observations [math]\displaystyle{ ( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N} }[/math] in ([math]\displaystyle{ \mathcal{X}\times \mathcal{Y}\times \mathbb{R} }[/math]), where [math]\displaystyle{ t_i }[/math] is the rating of user [math]\displaystyle{ x_i }[/math] for object [math]\displaystyle{ y_i }[/math], we intend to find a function [math]\displaystyle{ f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R} }[/math] that can be used in prediction of any user [math]\displaystyle{ x }[/math] rating for any object [math]\displaystyle{ y }[/math]. [math]\displaystyle{ x_i }[/math] and [math]\displaystyle{ y_i }[/math] correspond to [math]\displaystyle{ i }[/math]-th available rating.

In this paper, learning preference functions [math]\displaystyle{ f(.,.) }[/math], or bilinear of the form [math]\displaystyle{ f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} }[/math] for some compact operator [math]\displaystyle{ F }[/math], on [math]\displaystyle{ \mathcal{X} \times \mathcal{Y} }[/math], is considered. The set of compact operators from [math]\displaystyle{ \mathcal{X} }[/math] to [math]\displaystyle{ \mathcal{Y} }[/math] is denoted by [math]\displaystyle{ \mathcal{B}_0(\mathcal{Y},\mathcal{X}) }[/math].

So, our problem is to answer to this question: `` Given a trainign set of ratings, how may we estimate a good compact operator [math]\displaystyle{ F }[/math] to predict future ratings?<ref name="self"> </ref>.

The operator is defined as the solution to an optimization problem over [math]\displaystyle{ \mathcal{B}_0(\mathcal{Y,X}) }[/math]. The objective function of this optimization problem balances the data fitting term [math]\displaystyle{ R_N(F) }[/math] with a regularization term [math]\displaystyle{ \Omega(F) }[/math]:

[math]\displaystyle{ R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} }[/math]

[math]\displaystyle{ l(t^',t) }[/math] is a loss function which quantifies the quality of predicted [math]\displaystyle{ t^' }[/math] compared to [math]\displaystyle{ t }[/math] (the true value). The choice of the loss function depends on the nature of the variable to be predicted and the problem.

Also, the regularization term is defined as a spectral penalty function:

[math]\displaystyle{ \Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))} }[/math]

.

where [math]\displaystyle{ s_i }[/math] is a non-decreasing penalty function and [math]\displaystyle{ s(0)=0 }[/math], and [math]\displaystyle{ (\sigma_i(F))_{i=1,...,d} }[/math] are the [math]\displaystyle{ d }[/math] singular values of [math]\displaystyle{ F }[/math] in decreasing order.

So, an operator [math]\displaystyle{ \hat{F} }[/math] should be found that solves this optimization problem:

[math]\displaystyle{ \hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) }[/math]

,

parameter [math]\displaystyle{ \lambda \in \mathbb{R}_{+} }[/math] controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of loss function ([math]\displaystyle{ l }[/math]) defines the empirical risk, and it depends on type of data and final objective of the algorithm. The choice of [math]\displaystyle{ \Omega(F) }[/math] defines the imposed constraint on the operator, e.g. rank or trace norm constraint. Also, the this choice affects efficiency or feasibility of the learning algorithm. The kernel choice determines the embeddings of the users and objects in their spaces. It should be selected based on the available data and the problem to be solved. The authors use Dirac kernels and attribute kernels are between two users/objects and attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

We assume that each pair of user/object is observed at most once and the data can be considered as a [math]\displaystyle{ n_{\mathcal{X}} \times n_{\mathcal{Y}} }[/math] incomplete matrix. [math]\displaystyle{ n_{\mathcal{X}} }[/math] and [math]\displaystyle{ n_{\mathcal{Y}} }[/math] are the number of groups that identical data points of [math]\displaystyle{ \mathbf{x}_i }[/math] and [math]\displaystyle{ \mathbf{y}_i }[/math] (user and objects) can be organized to, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of available entries of matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a semi-definite program. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects [math]\displaystyle{ \mathbf{y} }[/math]. For a finite number of users, the goal is to estimate separate function [math]\displaystyle{ f_{i}(\mathbf{y}) }[/math] for each user [math]\displaystyle{ i }[/math]. Each estimation is a learning task. Therefore, this problem will be a multi-task learning. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilber-Schmit penalty and use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).

There are cases that we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

[math]\displaystyle{ \left\{ \begin{array}{lr} k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\ k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D \end{array} \right. }[/math]

where [math]\displaystyle{ 0 \leq \eta \leq 1 }[/math] and [math]\displaystyle{ 0 \leq \zeta \leq 1 }[/math]

Optimization Problem

Now the question is how to solve the general optimziation problem proposed by the authors. The biggest problem is that the optimization problem can be of infinite dimension. So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to slove it.

The authors first show that in the case of using a Hilbert-Schmidt norm as the penalty function ([math]\displaystyle{ \Omega(F) }[/math]), the set of [math]\displaystyle{ {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) \lt \infty } }[/math] in a set of Hilbert-Schmit operators, and the optimziation problem can be written as:

[math]\displaystyle{ \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes }[/math]

It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function [math]\displaystyle{ \Omega(F) }[/math]. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of [math]\displaystyle{ \mathcal{X}_N \otimes \mathcal{Y}_N }[/math] which is finite-dimensional(set of [math]\displaystyle{ m_{\mathcal{X}} \times m_{\mathcal{Y}} }[/math] matrices). In other words, this results in finding [math]\displaystyle{ F }[/math] such that: [math]\displaystyle{ F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}} }[/math], and [math]\displaystyle{ \mathbf{u}_i }[/math]s and [math]\displaystyle{ \mathbf{v}_i }[/math]s are orthonormal bases of [math]\displaystyle{ \mathcal{X}_N }[/math] and [math]\displaystyle{ \mathcal{Y}_N }[/math]. The coefficients [math]\displaystyle{ \alpha }[/math] can be found by solving the following finite-dimensional optimziation problem:

[math]\displaystyle{ min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) }[/math]

.

Assuming the loss function is convex which is usually held in practice, authors derive the convex dual problem to solve the optimization problem. [math]\displaystyle{ \psi_{i}(v_{i}) = l(v_i,t_i) }[/math] is the loss of predicting [math]\displaystyle{ v_i }[/math] for the [math]\displaystyle{ i }[/math]-th data point. By applying the representer theorem, the primal form of the optimization problem will look like:

[math]\displaystyle{ min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha) }[/math]

.

The dual problem can be obtained as maximizing the following expression:

[math]\displaystyle{ - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) }[/math]

where [math]\displaystyle{ \beta \in \mathbb{R}^N }[/math] is the Lagrangian multiplier, and [math]\displaystyle{ \psi^{*}_i }[/math] and [math]\displaystyle{ \Omega^{*} }[/math] are Fenchel conjugates of [math]\displaystyle{ \psi_i }[/math] and [math]\displaystyle{ \Omega }[/math], respectively. For finding [math]\displaystyle{ \alpha }[/math], first the optimal dual variable [math]\displaystyle{ \beta }[/math] is found and then [math]\displaystyle{ \alpha }[/math] is among the Fenchel duals of [math]\displaystyle{ -\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y }[/math].

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to [math]\displaystyle{ m_{\mathcal{X}} m_{\mathcal{Y}} }[/math] and [math]\displaystyle{ N }[/math], respectively, and based on the available ratings [math]\displaystyle{ N }[/math] compared to [math]\displaystyle{ m_{\mathcal{X}} m_{\mathcal{Y}} }[/math], one of the primal or dual programs will be chosen.

Experiments

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. First series of experiments were done on synthetic data which was created by: first sampling i.i.d. multivariate features for [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math] (both of dimension 6). Then, sampling [math]\displaystyle{ z }[/math] from a bilinear form in [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math] plus noise, and finally, restricting the observed features to only 3 for both [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math]. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are [math]\displaystyle{ \eta=0,\zeta=0 }[/math], [math]\displaystyle{ \eta=0,\zeta=1 }[/math],[math]\displaystyle{ \eta=1,\zeta=0 }[/math] , and [math]\displaystyle{ \eta=1,\zeta=1 }[/math].

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root means squared has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentinoed square.


Conclusion

Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.


References

<references />