statwiki - User contributions [US]

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-19T22:30:40Z

Y24Sun: /* Experiments */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite-dimensional space. So, they showed that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first showed that in the case of using a Hilbert-Schmidt norm as the penalty function (<math>\,\Omega(F)</math>), the set of <math> \{F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty \}</math> is the set of [http://en.wikipedia.org/wiki/Hilbert%E2%80%93Schmidt_operator Hilbert-Schmidt operators], and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by the representer theorem (details regarding it are available in Peter Bartlett's [http://www-users.cselabs.umn.edu/classes/Fall-2008/csci5525/notes/repthm.pdf lecture notes]) that the solution to this problem lives in the [http://en.wikipedia.org/wiki/Linear_span linear span] of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref> A concise proof of this property is on slide number 30 of Abernethy ''et al.'''s [http://cbio.ensmp.fr/~jvert/talks/091023rennes/rennes.pdf slides] titled '''Collaborative filtering in Hilbert spaces with spectral regularization'''.

The authors stated a generalized representer theorem (details regarding it are available in Scholkopf ''et al.'''s [http://users.cecs.anu.edu.au/~williams/papers/P139.pdf paper] titled '''A Generalized Representer Theorem''') which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\,\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the [http://en.wikipedia.org/wiki/Compact_operator compact operators], a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional (set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>\,F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and the <math>\mathbf{u}_i</math>s and the <math>\mathbf{v}_i</math>s are [http://mathworld.wolfram.com/OrthonormalBasis.html orthonormal bases] of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>, respectively. The coefficients <math>\,\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

The dimension of the space in this problem, however, might be prohibitively large for real-world applications where, for example, it is not uncommon to have tens of thousands of users and thousands of objects. Thus, to decrease the complexity at the expense of possibly losing convexity, it is convenient to constrain the [http://en.wikipedia.org/wiki/Rank_%28linear_algebra%29 rank] of the operator by using a well-chosen spectral penalty. Indeed, if we add a rank constraint <math>\,rank(F) \le r</math> to this optimization problem, then the representer theorem still holds whilst we obtain the benefit that the dimension of the parameter <math>\,\alpha</math> becomes significantly reduced from <math>{m_{\mathcal{X}}\times m_{\mathcal{Y}}}</math> to <math>\,r \times ({m_{\mathcal{X}} + m_{\mathcal{Y}}})</math>. However, some problems do in fact result from adding a rank constraint to the Hilbert-Schmidt norm penalty; one problem is that the ''classical'' representer theorem no longer holds, and a more important problem is that
doing so will likely lead to algorithmic / efficiency consequences.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, the authors derived the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\,\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>\,v_i</math> for the <math>\,i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\,\psi^{*}_i</math> and <math>\,\Omega^{*}</math> are the [http://www.wordiq.com/definition/Fenchel_conjugate Fenchel conjugates] of <math>\,\psi_i</math> and <math>\,\Omega</math>, respectively.
For finding <math>\,\alpha</math>, first the optimal dual variable <math>\,\beta</math> is found and then <math>\,\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>\,N</math>, respectively, and based on the available ratings <math>\,N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>\,x</math> and <math>\,y</math> (both having 6 dimensions). Next, they sampled <math>\,z</math> from a bilinear form in <math>\,x</math> and <math>\,y</math> plus noise, and finally, restricting the observed features to only <math>
\,3</math> for both <math>\,x</math> and <math>\,y</math>. The comparison results between two cases of applying the trace norm and the [http://mathworld.wolfram.com/FrobeniusNorm.html Frobenius norm] spectral penalties (in the same rank-constrained problem) showed that the trace norm penalty performed slightly better. Also, the best performance in both cases was achieved when there has been a balance between using the Dirac and the attribute kernels, where the pair of values for <math>\,\eta</math> and <math>\,\zeta</math>
that define this combined usage of the Dirac and the attributes kernels would be in the middle of the square for which the corners are <math>(\,\eta=0,\,\zeta=0)</math>, <math>(\,\eta=0,\,\,\zeta=1)</math>,<math>(\,\eta=1,\,\zeta=0)</math> , and <math>(\,\eta=1,\,\zeta=1)</math>. This is shown in the figure below taken from Abernathy ''et al''.'s paper.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the [http://www.grouplens.org/node/73 MovieLens] data-set which includes rating of <math>\,1,682</math> movies by <math>\,943</math> users. Movie Rating have been selected from the set of numbers <math>\,1</math> to <math>\,5</math>. The data-set also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold [http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29 cross-validation] for the experiments on this data-set. Again, the best balance between the attribute and the Dirac kernels was obtainable inside the mentioned square.

==Conclusion==
The authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert-Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

A future research could be to further explore the multi-task learning algorithms, such as to study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of the norm and the rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-19T04:15:29Z

Y24Sun: /* Conclusion */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite-dimensional space. So, they showed that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first showed that in the case of using a Hilbert-Schmidt norm as the penalty function (<math>\,\Omega(F)</math>), the set of <math> \{F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty \}</math> is the set of [http://en.wikipedia.org/wiki/Hilbert%E2%80%93Schmidt_operator Hilbert-Schmidt operators], and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by the representer theorem (details regarding it are available in Peter Bartlett's [http://www-users.cselabs.umn.edu/classes/Fall-2008/csci5525/notes/repthm.pdf lecture notes]) that the solution to this problem lives in the [http://en.wikipedia.org/wiki/Linear_span linear span] of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref> A concise proof of this property is on slide number 30 of Abernethy ''et al.'''s [http://cbio.ensmp.fr/~jvert/talks/091023rennes/rennes.pdf slides] titled '''Collaborative filtering in Hilbert spaces with spectral regularization'''.

The authors stated a generalized representer theorem (details regarding it are available in Scholkopf ''et al.'''s [http://users.cecs.anu.edu.au/~williams/papers/P139.pdf paper] titled '''A Generalized Representer Theorem''') which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\,\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the [http://en.wikipedia.org/wiki/Compact_operator compact operators], a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional (set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>\,F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and the <math>\mathbf{u}_i</math>s and the <math>\mathbf{v}_i</math>s are [http://mathworld.wolfram.com/OrthonormalBasis.html orthonormal bases] of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>, respectively. The coefficients <math>\,\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

The dimension of the space in this problem, however, might be prohibitively large for real-world applications where, for example, it is not uncommon to have tens of thousands of users and thousands of objects. Thus, to decrease the complexity at the expense of possibly losing convexity, it is convenient to constrain the [http://en.wikipedia.org/wiki/Rank_%28linear_algebra%29 rank] of the operator by using a well-chosen spectral penalty. Indeed, if we add a rank constraint <math>\,rank(F) \le r</math> to this optimization problem, then the representer theorem still holds whilst we obtain the benefit that the dimension of the parameter <math>\,\alpha</math> becomes significantly reduced from <math>{m_{\mathcal{X}}\times m_{\mathcal{Y}}}</math> to <math>\,r \times ({m_{\mathcal{X}} + m_{\mathcal{Y}}})</math>. However, some problems do in fact result from adding a rank constraint to the Hilbert-Schmidt norm penalty; one problem is that the ''classical'' representer theorem no longer holds, and a more important problem is that
doing so will likely lead to algorithmic / efficiency consequences.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, the authors derived the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\,\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>\,v_i</math> for the <math>\,i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\,\psi^{*}_i</math> and <math>\,\Omega^{*}</math> are the [http://www.wordiq.com/definition/Fenchel_conjugate Fenchel conjugates] of <math>\,\psi_i</math> and <math>\,\Omega</math>, respectively.
For finding <math>\,\alpha</math>, first the optimal dual variable <math>\,\beta</math> is found and then <math>\,\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>\,N</math>, respectively, and based on the available ratings <math>\,N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>\,x</math> and <math>\,y</math> (both having 6 dimensions). Next, they sampled <math>\,z</math> from a bilinear form in <math>\,x</math> and <math>\,y</math> plus noise, and finally, restricting the observed features to only <math>
\,3</math> for both <math>\,x</math> and <math>\,y</math>. The comparison results between two cases of applying the trace norm and the [http://mathworld.wolfram.com/FrobeniusNorm.html Frobenius norm] spectral penalties (in the same rank-constrained problem) showed that the trace norm penalty performed slightly better. Also, the best performance in both cases was achieved when there has been a balance between using the Dirac and the attribute kernels, where the pair of values for <math>\,\eta</math> and <math>\,\zeta</math>
that define this combined usage of the Dirac and the attributes kernels would be in the middle of the square for which the corners are <math>(\,\eta=0,\,\zeta=0)</math>, <math>(\,\eta=0,\,\,\zeta=1)</math>,<math>(\,\eta=1,\,\zeta=0)</math> , and <math>(\,\eta=1,\,\zeta=1)</math>. This is shown in the figure below taken from Abernathy ''et al''.'s paper.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the [http://www.grouplens.org/node/73 MovieLens data-set] which includes rating of <math>\,1,682</math> movies by <math>\,943</math> users. Movie Rating have been selected from the set of numbers <math>\,1</math> to <math>\,5</math>. The data-set also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold [http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29 cross-validation] for the experiments on this data-set. Again, the best balance between the attribute and the Dirac kernels was obtainable inside the mentioned square.

==Conclusion==
The authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert-Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

A future research could be to further explore the multi-task learning algorithms, such as to study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of the norm and the rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-19T03:21:59Z

Y24Sun: /* Experiments */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite-dimensional space. So, they showed that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first showed that in the case of using a Hilbert-Schmidt norm as the penalty function (<math>\,\Omega(F)</math>), the set of <math> \{F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty \}</math> is the set of [http://en.wikipedia.org/wiki/Hilbert%E2%80%93Schmidt_operator Hilbert-Schmidt operators], and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by the representer theorem (details regarding it are available in Peter Bartlett's [http://www-users.cselabs.umn.edu/classes/Fall-2008/csci5525/notes/repthm.pdf lecture notes]) that the solution to this problem lives in the [http://en.wikipedia.org/wiki/Linear_span linear span] of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref> A concise proof of this property is on slide number 30 of Abernethy ''et al.'''s [http://cbio.ensmp.fr/~jvert/talks/091023rennes/rennes.pdf slides] titled '''Collaborative filtering in Hilbert spaces with spectral regularization'''.

The authors stated a generalized representer theorem (details regarding it are available in Scholkopf ''et al.'''s [http://users.cecs.anu.edu.au/~williams/papers/P139.pdf paper] titled '''A Generalized Representer Theorem''') which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\,\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the [http://en.wikipedia.org/wiki/Compact_operator compact operators], a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional (set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>\,F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and the <math>\mathbf{u}_i</math>s and the <math>\mathbf{v}_i</math>s are [http://mathworld.wolfram.com/OrthonormalBasis.html orthonormal bases] of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>, respectively. The coefficients <math>\,\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

The dimension of the space in this problem, however, might be prohibitively large for real-world applications where, for example, it is not uncommon to have tens of thousands of users and thousands of objects. Thus, to decrease the complexity at the expense of possibly losing convexity, it is convenient to constrain the [http://en.wikipedia.org/wiki/Rank_%28linear_algebra%29 rank] of the operator by using a well-chosen spectral penalty. Indeed, if we add a rank constraint <math>\,rank(F) \le r</math> to this optimization problem, then the representer theorem still holds whilst we obtain the benefit that the dimension of the parameter <math>\,\alpha</math> becomes significantly reduced from <math>{m_{\mathcal{X}}\times m_{\mathcal{Y}}}</math> to <math>\,r \times ({m_{\mathcal{X}} + m_{\mathcal{Y}}})</math>. However, some problems do in fact result from adding a rank constraint to the Hilbert-Schmidt norm penalty; one problem is that the ''classical'' representer theorem no longer holds, and a more important problem is that
doing so will likely lead to algorithmic / efficiency consequences.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, the authors derived the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\,\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>\,v_i</math> for the <math>\,i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\,\psi^{*}_i</math> and <math>\,\Omega^{*}</math> are the [http://www.wordiq.com/definition/Fenchel_conjugate Fenchel conjugates] of <math>\,\psi_i</math> and <math>\,\Omega</math>, respectively.
For finding <math>\,\alpha</math>, first the optimal dual variable <math>\,\beta</math> is found and then <math>\,\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>\,N</math>, respectively, and based on the available ratings <math>\,N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>\,x</math> and <math>\,y</math> (both having 6 dimensions). Next, they sampled <math>\,z</math> from a bilinear form in <math>\,x</math> and <math>\,y</math> plus noise, and finally, restricting the observed features to only <math>
\,3</math> for both <math>\,x</math> and <math>\,y</math>. The comparison results between two cases of applying the trace norm and the [http://mathworld.wolfram.com/FrobeniusNorm.html Frobenius norm] spectral penalties (in the same rank-constrained problem) showed that the trace norm penalty performed slightly better. Also, the best performance in both cases was achieved when there has been a balance between using the Dirac and the attribute kernels, where the pair of values for <math>\,\eta</math> and <math>\,\zeta</math>
that define this combined usage of the Dirac and the attributes kernels would be in the middle of the square for which the corners are <math>(\,\eta=0,\,\zeta=0)</math>, <math>(\,\eta=0,\,\,\zeta=1)</math>,<math>(\,\eta=1,\,\zeta=0)</math> , and <math>(\,\eta=1,\,\zeta=1)</math>. This is shown in the figure below taken from Abernathy ''et al''.'s paper.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the [http://www.grouplens.org/node/73 MovieLens data-set] which includes rating of <math>\,1,682</math> movies by <math>\,943</math> users. Movie Rating have been selected from the set of numbers <math>\,1</math> to <math>\,5</math>. The data-set also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold [http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29 cross-validation] for the experiments on this data-set. Again, the best balance between the attribute and the Dirac kernels was obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-19T03:08:26Z

Y24Sun: /* Experiments */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite-dimensional space. So, they showed that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first showed that in the case of using a Hilbert-Schmidt norm as the penalty function (<math>\,\Omega(F)</math>), the set of <math> \{F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty \}</math> is the set of [http://en.wikipedia.org/wiki/Hilbert%E2%80%93Schmidt_operator Hilbert-Schmidt operators], and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by the representer theorem (details regarding it are available in Peter Bartlett's [http://www-users.cselabs.umn.edu/classes/Fall-2008/csci5525/notes/repthm.pdf lecture notes]) that the solution to this problem lives in the [http://en.wikipedia.org/wiki/Linear_span linear span] of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref> A concise proof of this property is on slide number 30 of Abernethy ''et al.'''s [http://cbio.ensmp.fr/~jvert/talks/091023rennes/rennes.pdf slides] titled '''Collaborative filtering in Hilbert spaces with spectral regularization'''.

The authors stated a generalized representer theorem (details regarding it are available in Scholkopf ''et al.'''s [http://users.cecs.anu.edu.au/~williams/papers/P139.pdf paper] titled '''A Generalized Representer Theorem''') which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\,\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the [http://en.wikipedia.org/wiki/Compact_operator compact operators], a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional (set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>\,F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and the <math>\mathbf{u}_i</math>s and the <math>\mathbf{v}_i</math>s are [http://mathworld.wolfram.com/OrthonormalBasis.html orthonormal bases] of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>, respectively. The coefficients <math>\,\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

The dimension of the space in this problem, however, might be prohibitively large for real-world applications where, for example, it is not uncommon to have tens of thousands of users and thousands of objects. Thus, to decrease the complexity at the expense of possibly losing convexity, it is convenient to constrain the [http://en.wikipedia.org/wiki/Rank_%28linear_algebra%29 rank] of the operator by using a well-chosen spectral penalty. Indeed, if we add a rank constraint <math>\,rank(F) \le r</math> to this optimization problem, then the representer theorem still holds whilst we obtain the benefit that the dimension of the parameter <math>\,\alpha</math> becomes significantly reduced from <math>{m_{\mathcal{X}}\times m_{\mathcal{Y}}}</math> to <math>\,r \times ({m_{\mathcal{X}} + m_{\mathcal{Y}}})</math>. However, some problems do in fact result from adding a rank constraint to the Hilbert-Schmidt norm penalty; one problem is that the ''classical'' representer theorem no longer holds, and a more important problem is that
doing so will likely lead to algorithmic / efficiency consequences.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, the authors derived the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\,\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>\,v_i</math> for the <math>\,i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\,\psi^{*}_i</math> and <math>\,\Omega^{*}</math> are the [http://www.wordiq.com/definition/Fenchel_conjugate Fenchel conjugates] of <math>\,\psi_i</math> and <math>\,\Omega</math>, respectively.
For finding <math>\,\alpha</math>, first the optimal dual variable <math>\,\beta</math> is found and then <math>\,\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>\,N</math>, respectively, and based on the available ratings <math>\,N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>\,x</math> and <math>\,y</math> (both having 6 dimensions). Next, they sampled <math>\,z</math> from a bilinear form in <math>\,x</math> and <math>\,y</math> plus noise, and finally, restricting the observed features to only <math>
\,3</math> for both <math>\,x</math> and <math>\,y</math>. The comparison results between two cases of applying the trace norm and the [http://mathworld.wolfram.com/FrobeniusNorm.html Frobenius norm] spectral penalties (in the same rank-constrained problem) showed that the trace norm penalty performed slightly better. Also, the best performance in both cases was achieved when there has been a balance between using the Dirac and the attributes kernels, where the pair of values for <math>\,\eta</math> and <math>\,\zeta</math>
that define this combined usage of the Dirac and the attributes kernels would be in the middle of the square for which the corners are <math>(\,\eta=0,\,\zeta=0)</math>, <math>(\,\eta=0,\,\,\zeta=1)</math>,<math>(\,\eta=1,\,\zeta=0)</math> , and <math>(\,\eta=1,\,\zeta=1)</math>. This is shown in the figure below taken from Abernathy ''et al''.'s paper.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the [http://www.grouplens.org/node/73 MovieLens data-set] which includes rating of <math>\,1,682</math> movies by <math>\,943</math> users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-19T02:52:24Z

Y24Sun: /* Experiments */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite-dimensional space. So, they showed that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first showed that in the case of using a Hilbert-Schmidt norm as the penalty function (<math>\,\Omega(F)</math>), the set of <math> \{F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty \}</math> is the set of [http://en.wikipedia.org/wiki/Hilbert%E2%80%93Schmidt_operator Hilbert-Schmidt operators], and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by the representer theorem (details regarding it are available in Peter Bartlett's [http://www-users.cselabs.umn.edu/classes/Fall-2008/csci5525/notes/repthm.pdf lecture notes]) that the solution to this problem lives in the [http://en.wikipedia.org/wiki/Linear_span linear span] of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref> A concise proof of this property is on slide number 30 of Abernethy ''et al.'''s [http://cbio.ensmp.fr/~jvert/talks/091023rennes/rennes.pdf slides] titled '''Collaborative filtering in Hilbert spaces with spectral regularization'''.

The authors stated a generalized representer theorem (details regarding it are available in Scholkopf ''et al.'''s [http://users.cecs.anu.edu.au/~williams/papers/P139.pdf paper] titled '''A Generalized Representer Theorem''') which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\,\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the [http://en.wikipedia.org/wiki/Compact_operator compact operators], a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional (set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>\,F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and the <math>\mathbf{u}_i</math>s and the <math>\mathbf{v}_i</math>s are [http://mathworld.wolfram.com/OrthonormalBasis.html orthonormal bases] of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>, respectively. The coefficients <math>\,\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

The dimension of the space in this problem, however, might be prohibitively large for real-world applications where, for example, it is not uncommon to have tens of thousands of users and thousands of objects. Thus, to decrease the complexity at the expense of possibly losing convexity, it is convenient to constrain the [http://en.wikipedia.org/wiki/Rank_%28linear_algebra%29 rank] of the operator by using a well-chosen spectral penalty. Indeed, if we add a rank constraint <math>\,rank(F) \le r</math> to this optimization problem, then the representer theorem still holds whilst we obtain the benefit that the dimension of the parameter <math>\,\alpha</math> becomes significantly reduced from <math>{m_{\mathcal{X}}\times m_{\mathcal{Y}}}</math> to <math>\,r \times ({m_{\mathcal{X}} + m_{\mathcal{Y}}})</math>. However, some problems do in fact result from adding a rank constraint to the Hilbert-Schmidt norm penalty; one problem is that the ''classical'' representer theorem no longer holds, and a more important problem is that
doing so will likely lead to algorithmic / efficiency consequences.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, the authors derived the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\,\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>\,v_i</math> for the <math>\,i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\,\psi^{*}_i</math> and <math>\,\Omega^{*}</math> are the [http://www.wordiq.com/definition/Fenchel_conjugate Fenchel conjugates] of <math>\,\psi_i</math> and <math>\,\Omega</math>, respectively.
For finding <math>\,\alpha</math>, first the optimal dual variable <math>\,\beta</math> is found and then <math>\,\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>\,N</math>, respectively, and based on the available ratings <math>\,N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>\,x</math> and <math>\,y</math> (both having 6 dimensions). Next, they sampled <math>\,z</math> from a bilinear form in <math>\,x</math> and <math>\,y</math> plus noise, and finally, restricting the observed features to only <math>
\,3</math> for both <math>\,x</math> and <math>\,y</math>. The comparison results between two cases of applying the trace norm and the [http://mathworld.wolfram.com/FrobeniusNorm.html Frobenius norm] spectral penalties (in the same rank-constrained problem) showed that the trace norm penalty performed slightly better. Also, the best performance in both cases was achieved when there has been a balance between using the Dirac and the attributes kernels, where the pair of values for <math>\,\eta</math> and <math>\,\zeta</math>
that define this combined usage of the Dirac and the attributes kernels would be in the middle of the square for which the corners are <math>(\,\eta=0,\,\zeta=0)</math>, <math>(\,\eta=0,\,\,\zeta=1)</math>,<math>(\,\eta=1,\,\zeta=0)</math> , and <math>(\,\eta=1,\,\zeta=1)</math>. This is shown in the figure below taken from Abernathy ''et al''.'s paper.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-19T01:49:58Z

Y24Sun: /* Optimization Problem */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite-dimensional space. So, they showed that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first showed that in the case of using a Hilbert-Schmidt norm as the penalty function (<math>\,\Omega(F)</math>), the set of <math> \{F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty \}</math> is the set of [http://en.wikipedia.org/wiki/Hilbert%E2%80%93Schmidt_operator Hilbert-Schmidt operators], and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by the representer theorem (details regarding it are available in Peter Bartlett's [http://www-users.cselabs.umn.edu/classes/Fall-2008/csci5525/notes/repthm.pdf lecture notes]) that the solution to this problem lives in the [http://en.wikipedia.org/wiki/Linear_span linear span] of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref> A concise proof of this property is on slide number 30 of Abernethy ''et al.'''s [http://cbio.ensmp.fr/~jvert/talks/091023rennes/rennes.pdf slides] titled '''Collaborative filtering in Hilbert spaces with spectral regularization'''.

The authors stated a generalized representer theorem (details regarding it are available in Scholkopf ''et al.'''s [http://users.cecs.anu.edu.au/~williams/papers/P139.pdf paper] titled '''A Generalized Representer Theorem''') which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\,\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the [http://en.wikipedia.org/wiki/Compact_operator compact operators], a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional (set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>\,F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and the <math>\mathbf{u}_i</math>s and the <math>\mathbf{v}_i</math>s are [http://mathworld.wolfram.com/OrthonormalBasis.html orthonormal bases] of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>, respectively. The coefficients <math>\,\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

The dimension of the space in this problem, however, might be prohibitively large for real-world applications where, for example, it is not uncommon to have tens of thousands of users and thousands of objects. Thus, to decrease the complexity at the expense of possibly losing convexity, it is convenient to constrain the [http://en.wikipedia.org/wiki/Rank_%28linear_algebra%29 rank] of the operator by using a well-chosen spectral penalty. Indeed, if we add a rank constraint <math>\,rank(F) \le r</math> to this optimization problem, then the representer theorem still holds whilst we obtain the benefit that the dimension of the parameter <math>\,\alpha</math> becomes significantly reduced from <math>{m_{\mathcal{X}}\times m_{\mathcal{Y}}}</math> to <math>\,r \times ({m_{\mathcal{X}} + m_{\mathcal{Y}}})</math>. However, some problems do in fact result from adding a rank constraint to the Hilbert-Schmidt norm penalty; one problem is that the ''classical'' representer theorem no longer holds, and a more important problem is that
doing so will likely lead to algorithmic / efficiency consequences.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, the authors derived the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\,\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>\,v_i</math> for the <math>\,i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\,\psi^{*}_i</math> and <math>\,\Omega^{*}</math> are the [http://www.wordiq.com/definition/Fenchel_conjugate Fenchel conjugates] of <math>\,\psi_i</math> and <math>\,\Omega</math>, respectively.
For finding <math>\,\alpha</math>, first the optimal dual variable <math>\,\beta</math> is found and then <math>\,\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>\,N</math>, respectively, and based on the available ratings <math>\,N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-19T01:31:31Z

Y24Sun: /* Optimization Problem */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite-dimensional space. So, they showed that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first showed that in the case of using a Hilbert-Schmidt norm as the penalty function (<math>\,\Omega(F)</math>), the set of <math> \{F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty \}</math> is the set of [http://en.wikipedia.org/wiki/Hilbert%E2%80%93Schmidt_operator Hilbert-Schmidt operators], and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by the representer theorem (details regarding it are available in Peter Bartlett's [http://www-users.cselabs.umn.edu/classes/Fall-2008/csci5525/notes/repthm.pdf lecture notes]) that the solution to this problem lives in the [http://en.wikipedia.org/wiki/Linear_span linear span] of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref> A concise proof of this property is on slide number 30 of Abernethy ''et al.'''s [http://cbio.ensmp.fr/~jvert/talks/091023rennes/rennes.pdf slides] titled '''Collaborative filtering in Hilbert spaces with spectral regularization'''.

The authors stated a generalized representer theorem (details regarding it are available in Scholkopf ''et al.'''s [http://users.cecs.anu.edu.au/~williams/papers/P139.pdf paper] titled '''A Generalized Representer Theorem''') which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\,\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the [http://en.wikipedia.org/wiki/Compact_operator compact operators], a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional (set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>\,F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and the <math>\mathbf{u}_i</math>s and the <math>\mathbf{v}_i</math>s are [http://mathworld.wolfram.com/OrthonormalBasis.html orthonormal bases] of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>, respectively. The coefficients <math>\,\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

The dimension of the space in this problem, however, might be prohibitively large for real-world applications where, for example, it is not uncommon to have tens of thousands of users and thousands of objects. Thus, to decrease the complexity at the expense of possibly losing convexity, it is convenient to constrain the [http://en.wikipedia.org/wiki/Rank_%28linear_algebra%29 rank] of the operator by using a well-chosen spectral penalty. Indeed, if we add a rank constraint <math>\,rank(F) \le r</math> to this optimization problem, then the representer theorem still holds whilst we obtain the benefit that the dimension of the parameter <math>\,\alpha</math> becomes significantly reduced from <math>{m_{\mathcal{X}}\times m_{\mathcal{Y}}}</math> to <math>\,r \times ({m_{\mathcal{X}} + m_{\mathcal{Y}}})</math>. However, some problems do in fact result from adding a rank constraint to the Hilbert-Schmidt norm penalty; one problem is that the ''classical'' representer theorem no longer holds, and a more important problem is that
doing so will likely lead to algorithmic / efficiency consequences.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, the authors derived the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\,\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>\,v_i</math> for the <math>\,i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\,\psi^{*}_i</math> and <math>\,\Omega^{*}</math> are the [http://www.wordiq.com/definition/Fenchel_conjugate Fenchel conjugates] of <math>\,\psi_i</math> and <math>\,\Omega</math>, respectively.
For finding <math>\,\alpha</math>, first the optimal dual variable <math>\,\beta</math> is found and then <math>\,\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-19T01:28:16Z

Y24Sun: /* Optimization Problem */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite-dimensional space. So, they showed that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first showed that in the case of using a Hilbert-Schmidt norm as the penalty function (<math>\,\Omega(F)</math>), the set of <math> \{F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty \}</math> is the set of [http://en.wikipedia.org/wiki/Hilbert%E2%80%93Schmidt_operator Hilbert-Schmidt operators], and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by the representer theorem (details regarding it are available in Peter Bartlett's [http://www-users.cselabs.umn.edu/classes/Fall-2008/csci5525/notes/repthm.pdf lecture notes]) that the solution to this problem lives in the [http://en.wikipedia.org/wiki/Linear_span linear span] of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref> A concise proof of this property is on slide number 30 of Abernethy ''et al.'''s [http://cbio.ensmp.fr/~jvert/talks/091023rennes/rennes.pdf slides] titled '''Collaborative filtering in Hilbert spaces with spectral regularization'''.

The authors stated a generalized representer theorem (details regarding it are available in Scholkopf ''et al.'''s [http://users.cecs.anu.edu.au/~williams/papers/P139.pdf paper] titled '''A Generalized Representer Theorem''') which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\,\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the [http://en.wikipedia.org/wiki/Compact_operator compact operators], a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional (set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>\,F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and the <math>\mathbf{u}_i</math>s and the <math>\mathbf{v}_i</math>s are [http://mathworld.wolfram.com/OrthonormalBasis.html orthonormal bases] of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>, respectively. The coefficients <math>\,\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

The dimension of the space in this problem, however, might be prohibitively large for real-world applications where, for example, it is not uncommon to have tens of thousands of users and thousands of objects. Thus, to decrease the complexity at the expense of possibly losing convexity, it is convenient to constrain the [http://en.wikipedia.org/wiki/Rank_%28linear_algebra%29 rank] of the operator by using a well-chosen spectral penalty. Indeed, if we add a rank constraint <math>\,rank(F) \le r</math> to this optimization problem, then the representer theorem still holds whilst we obtain the benefit that the dimension of the parameter <math>\,\alpha</math> becomes significantly reduced from <math>{m_{\mathcal{X}}\times m_{\mathcal{Y}}}</math> to <math>\,r \times ({m_{\mathcal{X}} + m_{\mathcal{Y}}})</math>. However, some problems do in fact result from adding a rank constraint to the Hilbert-Schmidt norm penalty; one problem is that the ''classical'' representer theorem no longer holds, and a more important problem is that
doing so will likely lead to algorithmic / efficiency consequences.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, the authors derived the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\,\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>\,v_i</math> for the <math>\,i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\,\psi^{*}_i</math> and <math>\,\Omega^{*}</math> are the [http://www.wordiq.com/definition/Fenchel_conjugate Fenchel conjugates] of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-19T01:17:09Z

Y24Sun: /* Optimization Problem */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite-dimensional space. So, they showed that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first showed that in the case of using a Hilbert-Schmidt norm as the penalty function (<math>\,\Omega(F)</math>), the set of <math> \{F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty \}</math> is the set of [http://en.wikipedia.org/wiki/Hilbert%E2%80%93Schmidt_operator Hilbert-Schmidt operators], and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by the representer theorem (details regarding it are available in Peter Bartlett's [http://www-users.cselabs.umn.edu/classes/Fall-2008/csci5525/notes/repthm.pdf lecture notes]) that the solution to this problem lives in the [http://en.wikipedia.org/wiki/Linear_span linear span] of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref> A concise proof of this property is on slide number 30 of Abernethy ''et al.'''s [http://cbio.ensmp.fr/~jvert/talks/091023rennes/rennes.pdf slides] titled '''Collaborative filtering in Hilbert spaces with spectral regularization'''.

The authors stated a generalized representer theorem (details regarding it are available in Scholkopf ''et al.'''s [http://users.cecs.anu.edu.au/~williams/papers/P139.pdf paper] titled '''A Generalized Representer Theorem''') which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\,\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the [http://en.wikipedia.org/wiki/Compact_operator compact operators], a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional (set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>\,F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and the <math>\mathbf{u}_i</math>s and the <math>\mathbf{v}_i</math>s are [http://mathworld.wolfram.com/OrthonormalBasis.html orthonormal bases] of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>, respectively. The coefficients <math>\,\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

The dimension of the space in this problem, however, might be prohibitively large for real-world applications where, for example, it is not uncommon to have tens of thousands of users and thousands of objects. Thus, to decrease the complexity at the expense of possibly losing convexity, it is convenient to constrain the [http://en.wikipedia.org/wiki/Rank_%28linear_algebra%29 rank] of the operator by using a well-chosen spectral penalty. Indeed, if we add a rank constraint <math>\,rank(F) \le r</math> to this optimization problem, then the representer theorem still holds whilst we obtain the benefit that the dimension of the parameter <math>\,\alpha</math> becomes significantly reduced from <math>{m_{\mathcal{X}}\times m_{\mathcal{Y}}}</math> to <math>\,r \times ({m_{\mathcal{X}} + m_{\mathcal{Y}}})</math>. However, some problems do in fact result from adding a rank constraint to the Hilbert-Schmidt norm penalty; one problem is that the ''classical'' representer theorem no longer holds, and a more important problem is that
doing so will likely lead to algorithmic / efficiency consequences.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-19T01:10:39Z

Y24Sun: /* Optimization Problem */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite-dimensional space. So, they showed that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first showed that in the case of using a Hilbert-Schmidt norm as the penalty function (<math>\,\Omega(F)</math>), the set of <math> \{F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty \}</math> is the set of [http://en.wikipedia.org/wiki/Hilbert%E2%80%93Schmidt_operator Hilbert-Schmidt operators], and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by the representer theorem (details regarding it are available in Peter Bartlett's [http://www-users.cselabs.umn.edu/classes/Fall-2008/csci5525/notes/repthm.pdf lecture notes]) that the solution to this problem lives in the [http://en.wikipedia.org/wiki/Linear_span linear span] of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref> A concise proof of this property is on slide number 30 of Abernethy ''et al.'''s [http://cbio.ensmp.fr/~jvert/talks/091023rennes/rennes.pdf slides] titled '''Collaborative filtering in Hilbert spaces with spectral regularization'''.

The authors stated a generalized representer theorem (details regarding it are available in Scholkopf ''et al.'''s [http://users.cecs.anu.edu.au/~williams/papers/P139.pdf paper] titled '''A Generalized Representer Theorem''') which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\,\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the [http://en.wikipedia.org/wiki/Compact_operator compact operators], a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional (set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>\,F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and the <math>\mathbf{u}_i</math>s and the <math>\mathbf{v}_i</math>s are [http://mathworld.wolfram.com/OrthonormalBasis.html orthonormal bases] of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>, respectively. The coefficients <math>\,\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

The dimension of the space in this problem, however, might be prohibitively large for real-world applications where, for example, it is not uncommon to have tens of thousands of users and thousands of objects. Thus, to decrease the complexity at the expense of possibly losing convexity, it is convenient to constrain the [http://en.wikipedia.org/wiki/Rank_%28linear_algebra%29 rank] of the operator by using a well-chosen spectral penalty. Indeed, if we add a rank constraint <math>\,rank(F) \le r</math> to this optimization problem, then the representer theorem still holds whilst we obtain the benefit that the dimension of the parameter <math>\,\alpha</math> becomes significantly reduced from <math>{m_{\mathcal{X}}\times m_{\mathcal{Y}}}</math> to <math>\,r \times ({m_{\mathcal{X}} + m_{\mathcal{Y}}})</math>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-19T00:47:30Z

Y24Sun: /* Optimization Problem */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite-dimensional space. So, they showed that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first showed that in the case of using a Hilbert-Schmidt norm as the penalty function (<math>\,\Omega(F)</math>), the set of <math> \{F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty \}</math> is the set of [http://en.wikipedia.org/wiki/Hilbert%E2%80%93Schmidt_operator Hilbert-Schmidt operators], and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by the representer theorem (details regarding it are available in Peter Bartlett's [http://www-users.cselabs.umn.edu/classes/Fall-2008/csci5525/notes/repthm.pdf lecture notes]) that the solution to this problem lives in the [http://en.wikipedia.org/wiki/Linear_span linear span] of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref> A concise proof of this property is on slide number 30 of Abernethy ''et al.'''s [http://cbio.ensmp.fr/~jvert/talks/091023rennes/rennes.pdf slides] titled '''Collaborative filtering in Hilbert spaces with spectral regularization'''.

The authors stated a generalized representer theorem (details regarding it are available in Scholkopf ''et al.'''s [http://users.cecs.anu.edu.au/~williams/papers/P139.pdf paper] titled '''A Generalized Representer Theorem''') which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\,\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the [http://en.wikipedia.org/wiki/Compact_operator compact operators], a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional (set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>\,F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and the <math>\mathbf{u}_i</math>s and the <math>\mathbf{v}_i</math>s are [http://mathworld.wolfram.com/OrthonormalBasis.html orthonormal bases] of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>, respectively. The coefficients <math>\,\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-19T00:42:55Z

Y24Sun: /* Optimization Problem */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite-dimensional space. So, they showed that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first showed that in the case of using a Hilbert-Schmidt norm as the penalty function (<math>\,\Omega(F)</math>), the set of <math> \{F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty \}</math> is the set of [http://en.wikipedia.org/wiki/Hilbert%E2%80%93Schmidt_operator Hilbert-Schmidt operators], and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by the representer theorem (details regarding it are available in Peter Bartlett's [http://www-users.cselabs.umn.edu/classes/Fall-2008/csci5525/notes/repthm.pdf lecture notes]) that the solution to this problem lives in the [http://en.wikipedia.org/wiki/Linear_span linear span] of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref> A concise proof of this property is on slide number 30 of Abernethy ''et al.'''s [http://cbio.ensmp.fr/~jvert/talks/091023rennes/rennes.pdf slides] titled '''Collaborative filtering in Hilbert spaces with spectral regularization'''.

The authors stated a generalized representer theorem (details regarding it are available in Scholkopf ''et al.'''s [http://users.cecs.anu.edu.au/~williams/papers/P139.pdf paper] titled '''A Generalized Representer Theorem''') which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\,\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the [http://en.wikipedia.org/wiki/Compact_operator compact operators], a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional (set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-19T00:40:21Z

Y24Sun: /* Optimization Problem */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite-dimensional space. So, they showed that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first showed that in the case of using a Hilbert-Schmidt norm as the penalty function (<math>\,\Omega(F)</math>), the set of <math> \{F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty \}</math> is the set of [http://en.wikipedia.org/wiki/Hilbert%E2%80%93Schmidt_operator Hilbert-Schmidt operators], and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by the representer theorem (details regarding it are available in Peter Bartlett's [http://www-users.cselabs.umn.edu/classes/Fall-2008/csci5525/notes/repthm.pdf lecture notes]) that the solution to this problem lives in the [http://en.wikipedia.org/wiki/Linear_span linear span] of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref> A concise proof of this property is on slide number 30 of Abernethy ''et al.'''s [http://cbio.ensmp.fr/~jvert/talks/091023rennes/rennes.pdf slides] titled '''Collaborative filtering in Hilbert spaces with spectral regularization'''.

The authors stated a generalized representer theorem (details regarding it are available in Scholkopf ''et al.'''s [http://users.cecs.anu.edu.au/~williams/papers/P139.pdf paper] titled '''A Generalized Representer Theorem''') which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\,\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-19T00:34:05Z

Y24Sun: /* Optimization Problem */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite-dimensional space. So, they showed that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first showed that in the case of using a Hilbert-Schmidt norm as the penalty function (<math>\,\Omega(F)</math>), the set of <math> \{F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty \}</math> is the set of [http://en.wikipedia.org/wiki/Hilbert%E2%80%93Schmidt_operator Hilbert-Schmidt operators], and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by the representer theorem (details regarding it are available in Peter Bartlett's [http://www-users.cselabs.umn.edu/classes/Fall-2008/csci5525/notes/repthm.pdf lecture notes]) that the solution to this problem lives in the [http://en.wikipedia.org/wiki/Linear_span linear span] of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref> A concise proof of this property is on slide number 30 of Abernethy ''et al.'''s [http://cbio.ensmp.fr/~jvert/talks/091023rennes/rennes.pdf slides] titled '''Collaborative filtering in Hilbert spaces with spectral regularization'''.

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-19T00:32:41Z

Y24Sun: /* Optimization Problem */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite-dimensional space. So, they showed that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first showed that in the case of using a Hilbert-Schmidt norm as the penalty function (<math>\,\Omega(F)</math>), the set of <math> \{F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty \}</math> is the set of [http://en.wikipedia.org/wiki/Hilbert%E2%80%93Schmidt_operator Hilbert-Schmidt operators], and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by the representer theorem (details regarding it are available in Peter Bartlett's [http://www-users.cselabs.umn.edu/classes/Fall-2008/csci5525/notes/repthm.pdf lecture notes]) that the solution to this problem lives in the [http://en.wikipedia.org/wiki/Linear_span linear span] of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref> A concise sketch proof of this property is on slide number 30 of Abernethy ''et al.'''s [http://cbio.ensmp.fr/~jvert/talks/091023rennes/rennes.pdf slides] titled '''Collaborative filtering in Hilbert spaces with spectral regularization'''.

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-19T00:13:52Z

Y24Sun: /* Optimization Problem */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite-dimensional space. So, they showed that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first showed that in the case of using a Hilbert-Schmidt norm as the penalty function (<math>\,\Omega(F)</math>), the set of <math> \{F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty \}</math> is the set of [http://en.wikipedia.org/wiki/Hilbert%E2%80%93Schmidt_operator Hilbert-Schmidt operators], and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by the representer theorem (details regarding it are available in Peter Bartlett's [http://www-users.cselabs.umn.edu/classes/Fall-2008/csci5525/notes/repthm.pdf lecture notes]) that the solution to this problem lives in the [http://en.wikipedia.org/wiki/Linear_span linear span] of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-19T00:10:16Z

Y24Sun: /* Optimization Problem */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite-dimensional space. So, they showed that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first showed that in the case of using a Hilbert-Schmidt norm as the penalty function (<math>\,\Omega(F)</math>), the set of <math> \{F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty \}</math> is the set of [http://en.wikipedia.org/wiki/Hilbert%E2%80%93Schmidt_operator Hilbert-Schmidt operators], and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-19T00:01:23Z

Y24Sun: /* Optimization Problem */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite-dimensional space. So, they showed that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first showed that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\,\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-18T23:38:32Z

Y24Sun: /* Optimization Problem */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite-dimensional space. So, they showed that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-18T23:36:53Z

Y24Sun: /* Optimization Problem */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the search space of the optimization problem, which is [[File: c5p3.jpg]], can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem-dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

File:C5p3.jpg

2010-12-18T23:33:57Z

Y24Sun:

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-17T00:28:54Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\,1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-17T00:24:46Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner. This extra degree
of freedom resulting from allowing both <math>\,\eta</math> and <math>\,\zeta</math> to vary continuously between <math>\,0</math> and <math>\1</math> provides a way that optimally balances the influence of the attributes in the function estimation process.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T23:59:27Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0)</math> corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T23:51:49Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>. These kernels combine the Dirac kernels <math>\,(\eta = 0 and \zeta = 0)</math> and the attributes kernels <math>\,(\eta = 1 and \zeta = 1)</math> by interpolating between them. In addition, combining this choice of kernels with, for example, the trace norm penalty function would enable us to continuously interpolate between the different settings that correspond to the different 'corners' in the <math>\,(\eta , \zeta)</math> square, which consist of standard CF with matrix completion at the <math>\,(0,0)</math> corner, multi-task learning at both the <math>\,(0,1)</math> and the <math>\,(1,0) corners, and last but not the least learning over pairs at the <math>\,(1,1)</math> corner.

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T23:29:49Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM if the hinge loss is used as the loss function) that is applied to the [http://en.wikipedia.org/wiki/Tensor_product tensor product] of <math>\,\mathcal{X}</math> and <math>\,\mathcal{Y}</math>. An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T23:22:03Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning. In addition, one can also use a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Using the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM). An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T23:14:20Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. If one selects the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and uses an appropriate loss function (e.g. hinge loss function) and a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier, then one obtains classical machine learning algorithms (e.g. SVM). An interesting point to note is that, as explained by Evgeniou et al. (2005), when <math>\,c = 0</math>, i.e. when we take a Dirac kernel for the users and an attribute kernel for the objects, then penalizing the Hilbert-Schmidt norm is equivalent to estimating [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] models for the users.

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T23:07:12Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. If one selects the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and uses an appropriate loss function (e.g. hinge loss function) and a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier, then one obtains classical machine learning algorithms (e.g. SVM).

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T23:02:40Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. By selecting the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and using an appropriate loss function (e.g. hinge loss function), as well as using a multi-task kernel, i.e. a kernel <math>\,k_{multi-task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier results in classical machine learning algorithms (e.g. SVM).

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T22:58:59Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. By selecting the [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the spectral penalty and using an appropriate loss function (e.g. hinge loss function), as well as using a multi-task kernel, i.e. a kernel <math>\,k_{multi−task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier results in classical machine learning algorithms (e.g. SVM).

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T22:55:23Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. By selecting the Hilbert-Schmidt
norm as the spectral penalty and using an appropriate loss function (e.g. hinge loss function), as well as using a multi-task kernel, i.e. a kernel <math>\,k_{multi−task}</math> over the entire product space <math>\,X \times Y</math> and defined between any two user/object pairs <math>\,(x,y)</math> and <math>\,(x',y')</math> by:

[[File:c5p2.jpg]], where <math>\,c > 0</math> controls how the variances of the classifiers are constrained in comparison to the norm of each classifier results in classical machine learning algorithms (e.g. SVM).

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

File:C5p2.jpg

2010-12-16T22:45:51Z

Y24Sun:

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T22:36:58Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem, i.e. we will learn all of the tasks, the <math>f_{i}</math>'s, simultaneously. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilbert-Schmidt penalty and the use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T22:32:04Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>\,i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning problem. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilber-Schmidt penalty and the use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).

There are cases in which we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T20:31:08Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of the available entries of the matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem, a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilber-Schmit penalty and use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).

There are cases that we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T20:24:57Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

If we assume that each pair of user/object is observed at most once, then the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. In addition, when the Dirac kernel is used for both users and objects, the users <math>\,{x_i}</math>'s and the objects <math>\,{y_i}</math>'s can be organized into <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> groups of identical data points, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of available entries of matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilber-Schmit penalty and use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).

There are cases that we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T19:57:05Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

When the Dirac kernel is used for both users and objects, we assume that each pair of user/object is observed at most once and the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix, where <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> are the numbers of groups of identical data points that <math>\mathbf{x}_i</math> and <math>\mathbf{y}_i</math> (user and objects) can be organized into, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of available entries of matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilber-Schmit penalty and use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).

There are cases that we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T19:55:42Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

When the Dirac kernel is used for both users and objects, we assume that each pair of user/object is observed at most once and thus the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix, where <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> are the numbers of groups of identical data points that <math>\mathbf{x}_i</math> and <math>\mathbf{y}_i</math> (user and objects) can be organized into, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of available entries of matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilber-Schmit penalty and use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).

There are cases that we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T19:41:03Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, and these two kernels are applied to users (with respect to objects) and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

We assume that each pair of user/object is observed at most once and the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> are the number of groups that identical data points of <math>\mathbf{x}_i</math> and <math>\mathbf{y}_i</math> (user and objects) can be organized to, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of available entries of matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilber-Schmit penalty and use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).

There are cases that we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T19:38:06Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, which are applied to pairs of users/objects and to attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

We assume that each pair of user/object is observed at most once and the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> are the number of groups that identical data points of <math>\mathbf{x}_i</math> and <math>\mathbf{y}_i</math> (user and objects) can be organized to, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of available entries of matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilber-Schmit penalty and use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).

There are cases that we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T19:36:46Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use [http://en.wikipedia.org/wiki/Dirac_delta_function Dirac] kernels and attribute kernels, which are between two users/objects and attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

We assume that each pair of user/object is observed at most once and the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> are the number of groups that identical data points of <math>\mathbf{x}_i</math> and <math>\mathbf{y}_i</math> (user and objects) can be organized to, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of available entries of matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilber-Schmit penalty and use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).

There are cases that we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T19:33:19Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, the choice of the kernel does not affect the algorithm, though it could easily affect the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use Dirac kernels and attribute kernels are between two users/objects and attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

We assume that each pair of user/object is observed at most once and the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> are the number of groups that identical data points of <math>\mathbf{x}_i</math> and <math>\mathbf{y}_i</math> (user and objects) can be organized to, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of available entries of matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilber-Schmit penalty and use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).

There are cases that we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T19:30:38Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be noted that, interestingly, which kernel is selected does not influence the algorithm, though it could easily influence the running time of the algorithm. The kernel should be selected based on the available data and the problem to be solved. The authors use Dirac kernels and attribute kernels are between two users/objects and attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

We assume that each pair of user/object is observed at most once and the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> are the number of groups that identical data points of <math>\mathbf{x}_i</math> and <math>\mathbf{y}_i</math> (user and objects) can be organized to, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of available entries of matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilber-Schmit penalty and use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).

There are cases that we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T19:25:19Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and the objects in their spaces. It should be selected based on the available data and the problem to be solved. The authors use Dirac kernels and attribute kernels are between two users/objects and attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

We assume that each pair of user/object is observed at most once and the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> are the number of groups that identical data points of <math>\mathbf{x}_i</math> and <math>\mathbf{y}_i</math> (user and objects) can be organized to, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of available entries of matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilber-Schmit penalty and use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).

There are cases that we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T19:23:22Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the [http://en.wikipedia.org/wiki/Embedding embedding] of the users and objects in their spaces. It should be selected based on the available data and the problem to be solved. The authors use Dirac kernels and attribute kernels are between two users/objects and attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

We assume that each pair of user/object is observed at most once and the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> are the number of groups that identical data points of <math>\mathbf{x}_i</math> and <math>\mathbf{y}_i</math> (user and objects) can be organized to, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of available entries of matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilber-Schmit penalty and use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).

There are cases that we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T19:20:32Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of the loss function (<math>\,l</math>) defines the empirical risk, and it depends on the type of data and the final objective of the algorithm. The choice of <math>\,\Omega(F)</math> defines the imposed constraint on the operator, e.g. the [http://en.wikipedia.org/wiki/Rank rank] or the [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, this choice affects the efficiency or the feasibility of the learning algorithm. The kernel choice determines the embeddings of the users and objects in their spaces. It should be selected based on the available data and the problem to be solved. The authors use Dirac kernels and attribute kernels are between two users/objects and attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

We assume that each pair of user/object is observed at most once and the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> are the number of groups that identical data points of <math>\mathbf{x}_i</math> and <math>\mathbf{y}_i</math> (user and objects) can be organized to, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of available entries of matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilber-Schmit penalty and use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).

There are cases that we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T19:18:19Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order. It should be noted that the value of this <math>\,d</math> has no upper bound.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of loss function (<math>l</math>) defines the empirical risk, and it depends on type of data and final objective of the algorithm. The choice of <math>\Omega(F)</math> defines the imposed constraint on the operator, e.g. [http://en.wikipedia.org/wiki/Rank rank] or [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, the this choice affects efficiency or feasibility of the learning algorithm. The kernel choice determines the embeddings of the users and objects in their spaces. It should be selected based on the available data and the problem to be solved. The authors use Dirac kernels and attribute kernels are between two users/objects and attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

We assume that each pair of user/object is observed at most once and the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> are the number of groups that identical data points of <math>\mathbf{x}_i</math> and <math>\mathbf{y}_i</math> (user and objects) can be organized to, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of available entries of matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilber-Schmit penalty and use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).

There are cases that we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T19:12:26Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\,\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of loss function (<math>l</math>) defines the empirical risk, and it depends on type of data and final objective of the algorithm. The choice of <math>\Omega(F)</math> defines the imposed constraint on the operator, e.g. [http://en.wikipedia.org/wiki/Rank rank] or [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, the this choice affects efficiency or feasibility of the learning algorithm. The kernel choice determines the embeddings of the users and objects in their spaces. It should be selected based on the available data and the problem to be solved. The authors use Dirac kernels and attribute kernels are between two users/objects and attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

We assume that each pair of user/object is observed at most once and the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> are the number of groups that identical data points of <math>\mathbf{x}_i</math> and <math>\mathbf{y}_i</math> (user and objects) can be organized to, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of available entries of matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilber-Schmit penalty and use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).

There are cases that we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />

a New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

2010-12-16T19:09:53Z

Y24Sun: /* Learning Compact Operators with Spectral Regularization */

==Introduction==

[http://en.wikipedia.org/wiki/Collaborative_filtering Collaborative filtering (CF)] aims to predict preferences of a set of users for objects such as movies, music, or any object based on previously purchased or rated objects by these users. The goal is to offer new objects to the users based on the predicted preferences. A good example of an application of collaborative filtering in the market place is the use of CF by the popular online shopping website Amazon.com for recommending related products to users of Amazon.com based on what these users have recently purchased from this website. Details of the use of CF by Amazon.com for this purpose is available [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1167344&userType=inst here].

[http://en.wikipedia.org/wiki/Regularization_%28mathematics%29 Regularization]-based CF methods have recently been attracting <ref name="srebro03">N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N.Mishra, editors,
Proceedings of the Twentieth International Conference on Machine Learning, pages 720–727.
AAAI Press, 2003. </ref>. These methods only incorporate the users preferences which can be given as a preference matrix in which the rows and columns represent users and objects, respectively. However, a major drawback of the currently-used regularization-based CF methods is that they do not take advantage of additional information, such as known attributes of each user, such as the users' gender and age, as well as the attributes of the object that are associated with the users, such as the authors and genres of books recently purchased by users. These additional information are often directly available at no extra cost and, from an intuitive standpoint, they might be very useful for guiding the inferencing of user preferences, in particular for users and their associated objects that have very few known ratings. In regularization-based CF methods, the problem is to infer the unknown entries based on the known ones.

In order to solve the problem, a low-rank matrix should be found that approximates the partially observed given preference matrix. This rank constraint is a regularization. Heuristics <ref name="srebro03"> </ref> and penalizing the predicted matrix by its [http://en.wikipedia.org/wiki/Matrix_norm trace norm] <ref name="srebro05">N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336,
Cambridge, MA, 2005. MIT Press. </ref> (Jason D. M. Rennie provides the different equivalent ways of expressing the trace norm of a matrix [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf here]) are two approaches for solving the resulting optimization problem.

Abernethy et al.<ref name="self"> J. Abernethy, F. Bach, T. Evgeniou, J. P. Vert. A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, Vol. 10(Mar):803--826, 2009</ref> present a generalized regularization-based approach for CF which learns the linear operators mapping users space to their desired objects space.

The novelty of this approach is the ability to incorporate additional information, such as different attributes of users/objects by means of kernel methods (details regarding them are available in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf lecture slides]). These attributes can help through predicting the unkown entries of the preference matrix. They also show that some multi-tasking learning methods are special cases of their general framework.

==Learning Compact Operators with Spectral Regularization==

In this section a mathematical formulation for a general CF problem with spectral
regularization (Lorenzo Rosasco provides details regarding it in his [http://www.mit.edu/~9.520/Classes/class07_spectral.pdf lecture slides]) is proposed. Given <math>\,N</math> observations <math>( \mathbf{x}_i, \mathbf{y}_i, t_i)_{i=1,..,N}</math> in (<math> \mathcal{X}\times \mathcal{Y}\times \mathbb{R}</math>), where <math>\,t_i</math> is the rating of user <math>\,x_i</math> for object <math>\,y_i</math>, we intend to find a function <math>f: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}</math> that can be used in prediction of any user <math>\,x</math>'s rating for any object <math>\,y</math>. <math>\,x_i</math> and <math>\,y_i</math> correspond to <math>\,i</math>-th available rating.

In this paper, learning preference functions <math>\,f(.,.)</math>, or [http://en.wikipedia.org/wiki/Bilinear_map bilinear] of the form <math>f(\mathbf{x},\mathbf{y}) = \langle \mathbf{x}, F \mathbf{y} \rangle_{\mathcal{X}} </math> for some [http://mathworld.wolfram.com/CompactOperator.html compact operator] <math>\,F</math>, on <math>\mathcal{X} \times \mathcal{Y}</math>, is considered. The set of compact operators from <math> \mathcal{X}</math> to <math>\mathcal{Y}</math> is denoted by <math>\mathcal{B}_0(\mathcal{Y},\mathcal{X})</math>.

So, our problem is to answer this question: "Given a training set of ratings, how may we estimate a good compact operator <math>\,F</math> to predict future ratings?"<ref name="self"> </ref>.

The operator is defined as the solution to an [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] over <math>\mathcal{B}_0(\mathcal{Y,X})</math>. The loss function <math>\,l(t',t)</math> quantifies how good a prediction <math>\,t′ \in R</math> is, provided that the true value is <math>\,t \in R</math>. The choice of the loss function depends on the nature of the variable to be predicted and the problem. The data fitting term that is considered is equal to the [http://en.wikipedia.org/wiki/Empirical_risk_minimization empirical risk)(i.e. the mean loss incurred on the training set). This fitting term is:

<center> <math> R_N(F)=\frac{1}{N} \sum^N_{i=1}{l(\langle \mathbf{x}_i,F\mathbf{y}_i \rangle_{\mathcal{X}},t_i)} </math> </center>

The regularization term <math>\Omega(F)</math> controls the complexity of the desired operator, and it is defined as the following spectral penalty function:

<center> <math>\Omega(F)=\sum^d_{i=1}{s_{i}(\sigma_i(F))}</math> </center>
, where <math>\,s_i</math> is a non-decreasing [http://en.wikipedia.org/wiki/Penalty_method penalty function] and <math>\,s(0)= 0</math>, and <math>(\,\sigma_i(F))_{i=1,...,d}</math> are the <math>\,d</math> singular values of <math>\,F</math> in decreasing order.

The objective function of this optimization problem balances the data fitting term <math>\,R_N(F)</math> with a regularization term <math>\,\Omega(F)</math>.

So, an operator <math>\hat{F}</math> should be found that solves this optimization problem:

<center> <math>\hat{F}\in \textrm{argmin}_{F\in\mathcal{B}_{0}(\mathcal{Y,X})} R_N(F)+\lambda\Omega(F) </math> </center>
where the parameter <math>\lambda \in \mathbb{R}_{+}</math> controls the trade-off between fitting and regularization.

The authors show that the above general problem can result in a desired algorithm by selecting an appropriate loss function, kernels or spectral penalty term. For instance, the choice of loss function (<math>l</math>) defines the empirical risk, and it depends on type of data and final objective of the algorithm. The choice of <math>\Omega(F)</math> defines the imposed constraint on the operator, e.g. [http://en.wikipedia.org/wiki/Rank rank] or [http://en.wikipedia.org/wiki/Trace_norm trace norm] constraint. Also, the this choice affects efficiency or feasibility of the learning algorithm. The kernel choice determines the embeddings of the users and objects in their spaces. It should be selected based on the available data and the problem to be solved. The authors use Dirac kernels and attribute kernels are between two users/objects and attributes, respectively. They also state that choice of kernel can lead to new methods for problem such as matrix completion, multi-task learning, and pairwise learning.

We assume that each pair of user/object is observed at most once and the data can be considered as a <math>n_{\mathcal{X}} \times n_{\mathcal{Y}}</math> incomplete matrix. <math>n_{\mathcal{X}}</math> and <math>n_{\mathcal{Y}}</math> are the number of groups that identical data points of <math>\mathbf{x}_i</math> and <math>\mathbf{y}_i</math> (user and objects) can be organized to, respectively. In the case of completing an incomplete matrix (matrix completion), finding a low-rank approximation of available entries of matrix is a popular strategy, for which, the loss function of square loss error can be combined to rank penalty constraint. To overcome the non-convexity of the problem a convex spectral function can be considered. For instance, in the case of binary preferences, hinge loss function with trace norm penalty can be selected. This can be written as a [http://en.wikipedia.org/wiki/Semidefinite_programming semi-definite program]. (maximum margin matrix factorization (MMMF) by <ref name="srebro05"></ref>).

If we access attributes only for objects <math>\mathbf{y}</math>. For a finite number of users, the goal is to estimate separate function <math>f_{i}(\mathbf{y})</math> for each user <math>i</math>. Each estimation is a learning task. Therefore, this problem will be a multi-task learning. In this case, attribute kernel is used for objects and Dirac kernel for the users, and the choice of loss function depends on what task exactly is solved. selecting a rank penalty function and replace it with trace norm penalty function to achieve convexity results in a similar approach proposed by <ref name="argy08">A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.</ref> for multi-task learning.

If we access attributes for both users and objects, we can apply the attributes kernels to each. Selecting the Hilber-Schmit penalty and use of an appropriate loss function (e.g. hinge loss function) results in classical machine learning algorithms (e.g. SVM).

There are cases that we access attributes of users and objects, and want to use them to guide us through better predictions, but the attributes are not informative enough. For instance, we access gender and age of users, but want to infer predictions based on different preferences of users with the same age and gender. So, we may use attribute kernel to apply the attributes of users in inference, but we also should apply Dirac kernel to perform inference for the different users with same attributes. The authors use the following convex combination of Dirac and attributes kernel <ref name="aber06"> J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.</ref>:

<center><math>
\left\{
\begin{array}{lr}
k^\mathcal{X} = \eta k^{\mathcal{X}}_A+(1-\eta)k^{\mathcal{X}}_D\\
k^\mathcal{Y} = \zeta k^{\mathcal{Y}}_A+(1-\zeta)k^{\mathcal{Y}}_D
\end{array}
\right.
</math></center>
where <math>0 \leq \eta \leq 1</math> and <math>0 \leq \zeta \leq 1</math>

==Optimization Problem==

Now the question is how to solve the general optimization problem proposed by the authors.
The biggest problem is that the optimization problem can be of infinite dimension.
So, they show that the proposed optimization problem can be modified to a finite-dimensional equivalent, and there are algorithms to solve it. Note that for optimization, we have two strategies, following the same two strategies in regular
kernel methods: using the primal problem of the actual
dimension of the underlying data or using the dual problem of dimension N (the number
of ratings). The choice between those two formulations is problem dependent.

The authors first show that in the case of using a [http://mathworld.wolfram.com/Hilbert-SchmidtNorm.html Hilbert-Schmidt norm] as the penalty function (<math>\Omega(F)</math>), the set of <math> {F\in \mathcal{B}_{0}(\mathcal{Y,X}): \Omega(F) < \infty }</math> in a set of Hilbert-Schmidt operators, and the optimization problem can be written as:<center> <math> \min_{f\in\mathcal{H}_{\otimes}} R_{N}(f)+\lambda\parallel f \parallel^2_\otimes </math> </center>
It can be shown by representer theorem that the solution to this problem lives in the linear span of the training data. <ref name="arons1950">N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.</ref> <ref name="schol01">B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416–426, Berlin/Heidelberg, 2001. Springer.</ref>

Authors state a generalized representer theorem which shows the reformulation of the problem to a finite-dimensional problem is valid in the case of using a general spectral function <math>\Omega(F)</math>. This theorem shows that when a spectral penalty function controls the complexity of the compact operators, a solution to the problem can be found in the space of <math>\mathcal{X}_N \otimes \mathcal{Y}_N</math> which is finite-dimensional(set of <math>m_{\mathcal{X}} \times m_{\mathcal{Y}}</math> matrices). In other words, this results in finding <math>F</math> such that: <math>F=\sum^{m_{\mathcal{X}}}_{i=1}{\sum^{m_{\mathcal{Y}}}_{j=1}{{\alpha}_{ij}\mathbf{u}_{i}\otimes\mathbf{v}_{j}}}</math>, and <math>\mathbf{u}_i</math>s and <math>\mathbf{v}_i</math>s are orthonormal bases of <math>\mathcal{X}_N</math> and <math>\mathcal{Y}_N</math>. The coefficients <math>\alpha</math> can be found by solving the following finite-dimensional optimization problem:
<center> <math> min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} R_{N}(\textrm{diag}(X\alpha Y^{T})) + \lambda \Omega(\alpha) </math> </center>.

Assuming the loss function is [http://en.wikipedia.org/wiki/Convex convex] which is usually held in practice, authors derive the [http://en.wikipedia.org/wiki/Convex convex] [http://en.wikipedia.org/wiki/Dual_problem dual problem] to solve the optimization problem. <math>\psi_{i}(v_{i}) = l(v_i,t_i)</math> is the loss of predicting <math>v_i</math> for the <math>i</math>-th data point. By applying the representer theorem, the primal form of the [http://en.wikipedia.org/wiki/Optimization_problem optimization problem] will look like:

<center> <math>min_{\alpha\in \mathbb{R}^{m_{\mathcal{X}}\times m_{\mathcal{Y}}}} \sum^N_{i=1}{\psi_{i}((X \alpha Y^{T})_{ii}) } + \lambda \Omega(\alpha)</math> </center>.

The dual problem can be obtained as maximizing the following expression:

<center> <math> - \sum^N_{i=1}{\psi^*_{i}(\beta_i)}-\lambda \Omega^{*}(- \frac{1}{\lambda}X^{T} \textrm{Diag}(\beta)Y) </math> </center>

where <math>\beta \in \mathbb{R}^N</math> is the Lagrangian multiplier, and <math>\psi^{*}_i</math> and <math>\Omega^{*}</math> are Fenchel conjugates of <math>\psi_i</math> and <math>\Omega</math>, respectively.
For finding <math>\alpha</math>, first the optimal dual variable <math>\beta</math> is found and then <math>\alpha</math> is among the Fenchel duals of <math>-\frac{1}{\lambda}X^{T}\textrm{Diag}(\beta)Y</math>.

Choosing either the primal or the dual formulation depends on the problem. For instance, in CF problem, the number of variables of the primal and dual formulation is equal to <math> m_{\mathcal{X}} m_{\mathcal{Y}}</math> and <math>N</math>, respectively, and based on the available ratings <math>N</math> compared to <math>m_{\mathcal{X}} m_{\mathcal{Y}}</math>, one of the primal or dual programs will be chosen.

==Experiments==

Experiments have been performed to show that the incorporating the extra information from users or objects to the CF approach can lead to more accurate predictions. The first series of experiments were done on synthetic data which was created by first sampling i.i.d. multivariate features for <math>x</math> and <math>y</math> (both of dimension 6). Next, they sampled <math>z</math> from a bilinear form in <math>x</math> and <math>y</math> plus noise, and finally, restricting the observed features to only 3 for both <math>x</math> and <math>y</math>. The comparison results between two cases of applying trace and Frobenius norm spectral penalty (in the same rank constrained problem) showed that the trace norm perform slightly better. Also, the best performace in both cases is achieved when there has been a balance between using the Dirac and attributes kernels. It would be in the middle of the square for which the corners are <math>(\eta=0,\zeta=0)</math>, <math>(\eta=0,\zeta=1)</math>,<math>(\eta=1,\zeta=0)</math> , and <math>(\eta=1,\zeta=1)</math>. This is shown in the figure below taken from Abernathy et al.

[[File:Abernathy.jpg|center]]

The second series of experiments were performed on the MovieLens dataset which includes rating of 1682 movies by 943 users. Rates have been selected from the set of numbers 1 to 5. The dataset also includes attribute of genre for movies and attributes of age, gender, and occupation for users. Root mean squared error (RMSE) has been measured in 10-fold croos validation for the expriments on this dataset. Again the best balance between the attribute and Dirac kernels is obtainable inside the mentioned square.

==Conclusion==
Authors presented a generalized formulation for the matrix completion problem. This formulation aims to find a linear compact operator between two Hilbert Schmidt spaces. Also, they showed that their generalized notion includes different approaches as its special cases. The experiments focused on the CF application and indicated that incorporating attribute information improves the accuracy of the predictions.

An future research could be to explore further the multi-task learning algorithms. To study the possibility to derive real-time implementations that may better fit the need for large-scale applications where training data is growing larger and larger. On the theoretical side, a better understanding of the effects of norm and rank regularizations and their interaction would be of an interesting topic.

==References==
<references />