a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization: Difference between revisions

From statwiki
Jump to navigation Jump to search
(Created page with "==Introduction== Collaborative filtering (CF) aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rat...")
 
No edit summary
Line 7: Line 7:
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. The problem will be inferring the unknown entries based on the known ones.  
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. The problem will be inferring the unknown entries based on the known ones.  


In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its trace norm <ref>N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its trace norm <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> are two approaches to solve the resulted optimziation problem.   
Cambridge, MA, 2005. MIT Press. </ref> are two approaches to solve the resulted optimziation problem.   
Line 14: Line 14:


The previledge and novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods. These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.
The previledge and novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods. These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.
==Learning Compact Operators with Spectral Regularization==
Given <math>N</math> observations (<math> \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>t_i</math> is the rating of user <math>x_i</math> for object <math>y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>x</math> rating for any object <math>y</math>. <math>x_i</math> and <math>y_i</math> correspond to <math>i</math>-th available rating.
In this paper, learning preference functions <math>f(.,.)</math>, or bilinear of the form <math>f(\mathbf{x},\mathbf{y}) = \rangle \mathbf{x}, F \mathbf{y} \langle_{\mathcal{X}} </math> for some compact operator <math>F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.
So, our problem is to answer to this question: `` Given a trainign set of ratings, how may we estimate a good compact operator $F$ to predict future ratings?''<ref name="self"> </ref>.
The operator is defined as the solution to an optimization problem over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The objective function of this optimization problem balances the data fitting term <math>R_N(F)</math> with a regularization term <math>\Omega(F)</math>:
<center> <math> R_N(F)=\frac{1}{N} \sum{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)}^N_{i=1} </math> </center>
<math>l(t^',t)</math> is a loss function which quantifies the quality of predicted <math>t^'</math> compared to <math>t</math> (the true value). The choice of the loss function depends on the nature of the variable to be predicted and the problem. 
Also, the regularization term is defined as a spectral penalty function:
<center> <math>\Omega(F)=\sum{s_{i}(\sigma_i(F))}^d_{i=1}</math> </center>.
where <math>s_i</math> is a non-decreasing penalty function and <math>s(0)=0</math>, and <math>(\sigma_i(F))_{i=1,...,d}</math> are the <math>d</math> singular values of <math>F</math> in decreasing order.
So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:
<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{y,X})} R_N(F)+\lambda\Omega(F) </math> </center>,
parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.
The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of loss function (<math>l</math>) defines the empirical risk, and it depends on type of data and final objective of the algorithm. The choice of <math>\Omega(F)</math> defines the imposed constraint on the operator, e.g. rank or trace norm constraint. Also, the this choice affects efficiency or feasibility of the learning algorithm. The kernel choice determines the embeddings of the users and objects in their spaces. It should be selected based on the available data and the problem to be solved. The authors use Dirac kernels and attribute kernels are between two users/objects and attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.
We assume that each pair of user/object is observed at most once and the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> are the number of groups that identical data points of <math>\mathbf{x}_i</math> and <math>\mathbf{y}_i</math> (user and objects) can be organized to, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of available entries of matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a semi-definite program. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).
If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.   
If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilber-Schmit penalty and use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).
There are cases that we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes.   





Revision as of 23:46, 2 December 2010

Introduction

Collaborative filtering (CF) aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences.

Regularization-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors, Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727. AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. The problem will be inferring the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its trace norm <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336, Cambridge, MA, 2005. MIT Press. </ref> are two approaches to solve the resulted optimziation problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The previledge and novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods. These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

Learning Compact Operators with Spectral Regularization

Given [math]\displaystyle{ N }[/math] observations ([math]\displaystyle{ \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N} }[/math] in ([math]\displaystyle{ \mathcal{X}\times \mathcal{Y}\times \mathbb{R} }[/math]), where [math]\displaystyle{ t_i }[/math] is the rating of user [math]\displaystyle{ x_i }[/math] for object [math]\displaystyle{ y_i }[/math], we intend to find a function [math]\displaystyle{ f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R} }[/math] that can be used in prediction of any user [math]\displaystyle{ x }[/math] rating for any object [math]\displaystyle{ y }[/math]. [math]\displaystyle{ x_i }[/math] and [math]\displaystyle{ y_i }[/math] correspond to [math]\displaystyle{ i }[/math]-th available rating.

In this paper, learning preference functions [math]\displaystyle{ f(.,.) }[/math], or bilinear of the form [math]\displaystyle{ f(\mathbf{x},\mathbf{y}) = \rangle \mathbf{x}, F \mathbf{y} \langle_{\mathcal{X}} }[/math] for some compact operator [math]\displaystyle{ F }[/math], on [math]\displaystyle{ \mathcal{X} \times \mathcal{Y} }[/math], is considered. The set of compact operators from [math]\displaystyle{ \mathcal{X} }[/math] to [math]\displaystyle{ \mathcal{Y} }[/math] is denoted by [math]\displaystyle{ \mathcal{B}_0(\mathcal{Y},\mathcal{X}) }[/math].

So, our problem is to answer to this question: `` Given a trainign set of ratings, how may we estimate a good compact operator $F$ to predict future ratings?<ref name="self"> </ref>.

The operator is defined as the solution to an optimization problem over [math]\displaystyle{ \mathcal{B}_0(\mathcal{Y,X}) }[/math]. The objective function of this optimization problem balances the data fitting term [math]\displaystyle{ R_N(F) }[/math] with a regularization term [math]\displaystyle{ \Omega(F) }[/math]:

[math]\displaystyle{ R_N(F)=\frac{1}{N} \sum{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)}^N_{i=1} }[/math]

[math]\displaystyle{ l(t^',t) }[/math] is a loss function which quantifies the quality of predicted [math]\displaystyle{ t^' }[/math] compared to [math]\displaystyle{ t }[/math] (the true value). The choice of the loss function depends on the nature of the variable to be predicted and the problem.

Also, the regularization term is defined as a spectral penalty function:

[math]\displaystyle{ \Omega(F)=\sum{s_{i}(\sigma_i(F))}^d_{i=1} }[/math]

.

where [math]\displaystyle{ s_i }[/math] is a non-decreasing penalty function and [math]\displaystyle{ s(0)=0 }[/math], and [math]\displaystyle{ (\sigma_i(F))_{i=1,...,d} }[/math] are the [math]\displaystyle{ d }[/math] singular values of [math]\displaystyle{ F }[/math] in decreasing order.

So, an operator [math]\displaystyle{ \hat{F} }[/math] should be found that solves this optimization problem:

[math]\displaystyle{ \hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{y,X})} R_N(F)+\lambda\Omega(F) }[/math]

,

parameter [math]\displaystyle{ \lambda \in \mathbb{R}_{+} }[/math] controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of loss function ([math]\displaystyle{ l }[/math]) defines the empirical risk, and it depends on type of data and final objective of the algorithm. The choice of [math]\displaystyle{ \Omega(F) }[/math] defines the imposed constraint on the operator, e.g. rank or trace norm constraint. Also, the this choice affects efficiency or feasibility of the learning algorithm. The kernel choice determines the embeddings of the users and objects in their spaces. It should be selected based on the available data and the problem to be solved. The authors use Dirac kernels and attribute kernels are between two users/objects and attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

We assume that each pair of user/object is observed at most once and the data can be considered as a [math]\displaystyle{ n_{\mathcal{X}} \times n_{\mathcal{Y}} }[/math] incomplete matrix. [math]\displaystyle{ n_{\mathcal{X}} }[/math] and [math]\displaystyle{ n_{\mathcal{Y}} }[/math] are the number of groups that identical data points of [math]\displaystyle{ \mathbf{x}_i }[/math] and [math]\displaystyle{ \mathbf{y}_i }[/math] (user and objects) can be organized to, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of available entries of matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a semi-definite program. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects [math]\displaystyle{ \mathbf{y} }[/math]. For a finite number of users, the goal is to estimate separate function [math]\displaystyle{ f_{i}(\mathbf{y}) }[/math] for each user [math]\displaystyle{ i }[/math]. Each estimation is a learning task. Therefore, this problem will be a multi-task learning. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilber-Schmit penalty and use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).

There are cases that we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes.



References

<references />