consistency of Trace Norm Minimization

From statwiki
Jump to navigation Jump to search

UNDER CONSTRUCTION

Introduction


In many applications, including collaborative filtering and multi-task learning, the main objective is to estimate a low rank rectangular matrix. More formally, this problem can be written as follows

[math]\displaystyle{ \min_{W\in \mathbb{R}^{p \times q}} \frac{1}{2n} \sum_{i=1}^n (z_i - \textrm{tr}W^TM_i)^2 \;\; \textrm{subject} \; \textrm{to} \;\; \textrm{rank}(W) \leq \delta, }[/math]

where [math]\displaystyle{ \,(z_i,M_i), }[/math] [math]\displaystyle{ i = 1, \ldots, n, }[/math] are given observations. Unfortunately, this is a difficult problem to solve because of the rank constraint. If the rank function can be replaced by a convex function then the optimization problem becomes much easier to solve. However, this convex function must still generate low rank solutions. Given these two objectives, many authors have proposed using the trace norm as an alternative to the rank function. The trace norm, or nuclear norm, of a matrix, is the sum of its singular values and it is the convex envelope (the largest lower bounding convex function) of the rank function over the unit ball of the spectral norm. Since the trace norm is the convex envelope of the rank function, adding the trace norm as a regularization term to the matrix approximation problem should still give low rank solutions. It is then natural to explore what happens when the data is actually generated by a low rank matrix. Under what conditions will the trace norm regularization problem accurately approximate the underlying matrix and under what conditions will the rank be accurately approximated? The author of this paper<ref name="B2008a">F. Bach. Consistency of Trace Norm Minimization. Journal of Machine Learning Research, 8:1019-1048, 2008.</ref> explores these questions and provides necessary and sufficient conditions for rank consistency.

The use of the trace norm regularization term in place of the rank function in approximating low rank matrices is similar to the use of the [math]\displaystyle{ l_1 }[/math]-norm regularization term in place of the [math]\displaystyle{ l_0 }[/math]-norm in approximating sparse vectors. In this case, the main objective is to estimate vectors with low cardinality. However, optimization problems with [math]\displaystyle{ l_0 }[/math]-norm constraints are hard to solve. Instead the [math]\displaystyle{ l_1 }[/math]-norm is often used as a regularization term to induce sparse loadings since the [math]\displaystyle{ l_1 }[/math]-norm is the convex envelop of the [math]\displaystyle{ l_0 }[/math] norm on the unit ball. For least squares regression, this regularization scheme is known as the Lasso<ref name="T1994">R. Tibshirani. Regression shirnkage and selection via the lasso. Journal Royal Statistics, 58(1): 267-299, 1994.</ref>. Once again we are interested in whether this procedure can consistently estimate sparse vectors and their sparsity patterns. Recent work by Yuan and Lin<ref name="YL2007">M. Yuan and Y. Lin. On the non-negative garrotte estimator. Journal of The Royal Statistical Society Series B, 69(2): 143-161, 2007.</ref>, Zhao and You<ref name="ZY2006">P. Zhao and B. Yu. On model selcetion consistency of lasso. Journal of Machine Learning Research, 7:2541-2563, 2006.</ref> and Zou<ref name="Z2006">H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101:1418-1429, December 2006.</ref> look at this question for the Lasso procedure. Similar work by Bach<ref name="B2008b">F.R. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9:1179-1225, 2009.</ref> looks at this question for the group Lasso procedure. Given the similarities between the low rank matrix approximation problem and the sparse vector approximation problem, the author compares the consistency results he obtains with the corresponding results for the Lasso and the group Lasso. In fact, he shows that the Lasso and group Lasso procedures are embedded in the trace norm regularization procedure and finds that their consistency conditions are merely extensions of the results obtained for the Lasso and group Lasso. The author also considers simulated data to test the consistency results.