# consistency of Trace Norm Minimization

## Contents

## Introduction

In many applications, including collaborative filtering, multi-task learning, and classification, the main objective is to estimate a low-rank rectangular matrix. More formally, this problem can be written as follows:

where [math]\,(z_i,M_i),[/math] [math]i = 1, \ldots, n,[/math] are given observations. Unfortunately, this is a difficult problem to solve because of the rank constraint.
In fact, the rank-minimization problem is generally NP-hard<ref name="fazel2004">Fazel, M. and Hindi, H. and Boyd, S. Rank minimization and applications in system theory. Proceedings of the 2004 IEEE American Control Conference (4):3273--3278, 2004.</ref>.
If the rank function can be replaced by a convex function then the optimization problem becomes much easier to solve. However, this convex function must still generate low rank solutions. Given these two objectives, many authors have proposed using the trace norm as an alternative to the rank function.
In other words, because using the rank-function results in an NP-hard problem, we resort to the trace norm as a proxy measure for the rank.
The trace norm, or nuclear norm, of a matrix, is the sum of its singular values and it is the convex envelope (the largest lower bounding convex function) of the rank function over the unit ball of the spectral norm. Since the trace norm is the convex envelope of the rank function, adding the trace norm as a regularization term to the matrix approximation problem should still give low rank solutions. It is then natural to explore what happens when the data is actually generated by a low rank matrix. Under what conditions will the trace norm regularization problem accurately approximate the underlying matrix and under what conditions will the rank be accurately approximated? The author of this paper<ref name="B2008a">F. Bach. Consistency of Trace Norm Minimization. *Journal of Machine Learning Research*, 8:1019-1048, 2008.</ref> explores these questions and provides necessary and sufficient conditions for rank consistency.

The use of the trace norm regularization term in place of the rank function in approximating low-rank matrices is similar to the use of the [math]\,l_1[/math]-norm (more details here) regularization term in place of the [math]\,l_0[/math]-norm (more details here) in approximating sparse vectors. In this case, the main objective is to estimate vectors with low cardinality. However, optimization problems with [math]\,l_0[/math]-norm constraints are hard to solve. Instead the [math]\,l_1[/math]-norm is often used as a regularization term to induce sparse loadings (vector coefficients), since the [math]\,l_1[/math]-norm is the convex envelope of the [math]\,l_0[/math]-norm on the unit ball. For least squares regression, this regularization scheme is known as the LASSO<ref name="T1994">R. Tibshirani. Regression shrinkage and selection via the lasso. *Journal Royal Statistics*, 58(1): 267-299, 1994.</ref>. Once again we are interested in whether this procedure can consistently estimate sparse vectors and their sparsity patterns. Recent work by Yuan and Lin<ref name="YL2007">M. Yuan and Y. Lin. On the non-negative garrotte estimator. *Journal of The Royal Statistical Society Series B*, 69(2): 143-161, 2007.</ref>, Zhao and You<ref name="ZY2006">P. Zhao and B. Yu. On model selection consistency of lasso. *Journal of Machine Learning Research*, 7:2541-2563, 2006.</ref> and Zou<ref name="Z2006">H. Zou. The adaptive lasso and its oracle properties. *Journal of the American Statistical Association*, 101:1418-1429, December 2006.</ref> looked at this question for the LASSO procedure. Similar work by Bach<ref name="B2008b">F.R. Bach. Consistency of the group lasso and multiple kernel learning. *Journal of Machine Learning Research, 9:1179-1225, 2009.</ref> looked at this question for the group Lasso procedure (more details regarding it can be found in Friedman *et al.
's paper). Given the similarities between the low rank matrix approximation problem and the sparse vector approximation problem (one such similarity is detailed in Shen *et al.'*s paper titled *Sparse principal component analysis via regularized low rank matrix approximation*), the author compared the consistency results that he obtained with the corresponding results for the LASSO and the group LASSO. In fact, he showed that the LASSO and the group LASSO procedures are in fact embedded in the trace norm regularization procedure itself, and he found that the consistency conditions are merely extensions of the results obtained for the LASSO and the group LASSO. The author also considered simulated data to test the consistency results. In addition, an adaptive version is developed by the authors that is capable of providing [math] n^{-\frac{1}{2}}[/math]-consistency and rank consistency when the necessary conditions are not satisfied.

## Notation

Before outlining the major results of this paper, we must first introduce some notation. Throughout the remainder of this summary, all vectors will be denoted by lowercase letters and all matrices by uppercase letters. For vectors [math]x \in \mathbb{R}^d[/math], let [math]\|\cdot\|[/math] denote the Euclidean norm, that is [math]\|x\| = \|x\|_2 = (x^Tx)^{1/2}[/math]. On rectangular matrices [math]M \in \mathbb{R}^{p \times q}[/math], consider the following three norms, where [math]\,\sigma[/math] is the vector of singular values.

- The spectral norm: The largest singular value, [math]\| M \|_2 = \|\mathbf{\sigma}\|_\infty[/math].
- The Frobenius norm: The [math]\,l_2[/math]-norm of singular values, [math]\| M \|_F = \|\mathbf{\sigma}\|_2 = (\textrm{tr}(M^TM))^{1/2}[/math].
- The trace norm: The sum of singular values, [math]\| M \|_* = \|\mathbf{\sigma}\|_1[/math].

Finally, we introduce the following standard asymptotic notation. A random variable [math]\,Z_n[/math] is said to be of order [math]\,O_p(a_n)[/math] if for any [math]\,\eta \gt 0[/math], there exists a [math]\,M \gt 0[/math] such that [math]\sup_nP(|Z_n| \gt Ma_n) \lt \eta[/math]. [math]\,Z_n[/math] is said to be of order [math]\,o_p(a_n)[/math] if [math]\,\frac{Z_n}{a_n}[/math] converges to zero in probability. In other words, [math]\,Z_n[/math] is of order [math]\,o_p(a_n)[/math] if [math]P(|Z_n| \geq \eta a_n) \rightarrow 0[/math] as [math]n \rightarrow \infty[/math] for any [math]\,\eta \gt 0[/math].

## Problem Formulation

Suppose we are given [math]\,n[/math] observations [math]\,(M_i,z_i)[/math], [math]i = 1,\ldots, n[/math] and we want to predict each [math]\,z_i[/math] as a linear function of [math]M_i \in \mathbb{R}^{p \times q}[/math] using as few entries of [math]\,M_i[/math] as possible. This optimization problem, which we will refer to as the trace norm matrix approximation problem, can be written formally as

, where the trace norm regularization term is used as a heuristic to minimize the rank of [math]\,W[/math]. Note that, if [math]M_i = \textrm{Diag}(x_i) \in \mathbb{R}^{m \times m}[/math] then the solution, [math]\,\hat{W}[/math], will be a diagonal matrix and in this case [math]\,\|W\|_*[/math] reduces to the [math]\,l_1[/math] norm of the diagonal. The optimization can be rewritten as

which is the LASSO formulation. So, the consistency results for the trace norm estimate should be extensions of the LASSO consistency results. In fact, the author showed that the trace norm consistency results are the same as the LASSO consistency results when the [math]\,M_i[/math]'s are diagonal. The group LASSO is another special case embedded in the trace norm matrix approximation problem. If [math]x_{ij} \in \mathbb{R}^{d_j}[/math] for [math]j = 1,\ldots,m,i = 1, \ldots, n[/math] and each [math]M_{i} \in \mathbb{R}^{\sum_{j=1}^md_j \times m}[/math] is a block diagonal matrix, then the solution will have the same structure. The resulting simplification of the trace norm matrix approximation problem is the group LASSO formulation. Once again, the author was able to show how his consistency results matched the previous group LASSO consistency results.

## Assumptions

Before we can derive the consistency results for [math]\hat{W}[/math], the solution to the matrix approximation problem, we must first make several assumptions about the sampling distributions of each [math]M_i \in \mathbb{R}^{p \times q}[/math]. Let [math]\hat\Sigma_{mm} = \frac{1}{n}\textrm{vec}(M_i)\textrm{vec}({M_i})^T \in \mathbb{R}^{pq \times pq}[/math], then

- Suppose the [math]\,n[/math] values of [math]\,z_i[/math] are [math]i.i.d.[/math], the [math]\,M_i[/math], [math]i = 1,\ldots,n[/math] are given and there exists a [math]\textbf{W} \in \mathbb{R}^{p \times q}[/math] such that for all [math]\,i[/math], [math]\mathbb{E}(z_i | M_1,\ldots, M_n) = \textrm{tr}\textbf{W}^TM_i[/math] and [math]\textrm{var}(z_i | M_1, \ldots, M_n) = \sigma^2[/math] where [math]\,\sigma^2[/math] is a strictly positive constant. [math]\textbf{W}[/math] is not equal to zero and does not have full rank. So, [math]\,rank(\textbf{W}) = \textbf{r} \lt \min\{p,q\}[/math].
- There exists an invertible matrix [math]\Sigma_{mm} \in \mathbb{R}^{pq \times pq}[/math] such that [math]\mathbb{E}\|\hat{\Sigma}_{mm}-\Sigma_{mm}\|_F^2 = O(\xi_n^2)[/math] for a certain sequence [math]\,\xi_n[/math] that tends to zero. In other words, the expected difference between [math]\hat{\Sigma}_{mm}[/math] and [math]\,\Sigma_{mm}[/math] should go to zero as [math]n \rightarrow \infty[/math] and [math]\hat{\Sigma}_{mm}=\frac{1}{n}\sum_{i=1}^n Vec(M_i)Vec(M_i)^T[/math]
- The random variable [math]n^{-1/2}\sum_{i=1}^n\epsilon_i \textrm{vec}(M_i)[/math] is converging in distribution to a normal distribution with mean zero and covariance matrix [math]\,\sigma^2\Sigma_{mm}[/math], where [math]\epsilon_i = z_i - \textrm{tr}\textbf{W}^TM_i[/math].

If the first assumption holds then, given the input matrices [math]\,M_i[/math], [math]i = 1,\ldots, n,[/math], we would have a linear prediction model in which the loading matrix (matrix of the coefficients) [math]\textbf{W}[/math] is non-trivial and rank-deficient (more details regarding it are available in Bindel's lecture notes). This is the matrix we are interested in estimating. Although the second and third conditions seem restrictive, there are two cases where they are naturally satisfied. Assume the first condition holds in both cases.

- Case 1: If the matrices [math]\,M_i[/math] are sampled i.i.d., [math]\,z[/math] and [math]\,M[/math] have finite fourth order moments and [math]\Sigma_{mm}^{-1}[/math] exists, then conditions 2 and 3 will be satisfied with [math]\,\xi_n = n^{-1/2}[/math].
- Case 2: This situation is representative of collaborative filtering. Suppose for each [math]\,i[/math], [math]M_i = x_iy_i^T[/math] where [math]\,x_i[/math] and [math]\,y_i[/math] are independent. If the values of [math]\,x_i[/math] and [math]\,y_i[/math] are both sampled [math]i.i.d[/math] from distributions with finite fourth order moments and invertible second order moment matrices ( [math]\,\Sigma_{xx}[/math] and [math]\,\Sigma_{yy}[/math] ) and if the [math]\,n[/math] values of [math]\,M_i[/math] are chosen uniformly from the [math]\,n_x n_y[/math] possible pairs of [math]\,x_i[/math] and [math]\,y_i[/math], then conditions 2 and 3 are also satisfied with [math]\xi_n = n^{-1/2}+n_x^{-1/2}+n_y^{-1/2}[/math].

## Optimality Conditions

From the expression of the subdifferential (details regarding it are available in Kusraev *et al.'*s paper ) of the trace norm in Proposition 21 (Appendix B) of the author's paper, the author identified the optimality condition for the original trace norm matrix approximation problem ( which is [math]\,(1)[/math] above ), which the author used constantly in his paper. This optimality condition is given as follows:

*Proposition*: The matrix [math]W[/math] with singular value decomposition [math]W = U Diag(s)V^T[/math] (with strictly
positive singular values [math]s[/math]) is optimal for the problem in Eq. (1) if and only if:

with [math]U^TN=0,NV=0[/math] and [math]\|N\|_2\leq\lambda_n[/math].

This implies notably that [math]\hat{\Sigma}_{mm}W-\hat{\Sigma}_{mz}[/math] and [math]W[/math] have simultaneous singular value decompositions, and the largest singular values are less than [math]\lambda[/math]It should be noted that, in the LASSO case mentioned above, the usual optimality conditions would be obtained.

## Consistency Results

The author considered two types of consistency for [math]\hat{W}[/math] defined as:

- Regular consistency: The estimate [math]\hat{W}[/math] is consistent if [math]P(\|\hat{W}-\textbf{W}\| \geq \epsilon) \rightarrow 0[/math] as [math]n \rightarrow \infty[/math] for all [math]\,\epsilon \gt 0[/math].
- Rank consistency: The estimate [math]\hat{W}[/math] is rank consistent if [math]P(\textrm{rank}(\hat{W})\neq \textrm{rank}(\textbf{W})) \rightarrow 0[/math] as [math]n \rightarrow \infty[/math].

Assuming all three conditions from the previous section hold, the author obtained the following results for the consistency of [math]\hat{W}[/math].

- If [math]\,\lambda_n[/math] does not tend to zero, then the trace norm estimate [math]\hat{W}[/math] is not consistent.
- [math]P(\|\hat{W}-\textbf{W}\| \geq \epsilon) \rightarrow l \neq 0[/math] as [math]n \rightarrow \infty[/math].

- If [math]\,\lambda_n[/math] tends to zero faster than [math]\,n^{-1/2}[/math], then the estimate is consistent and its error is [math]\,O_p(n^{-1/2})[/math] while it is not rank-consistent with probability tending to one.
- [math]P(\|\hat{W}-\textbf{W}\| \geq \epsilon) \rightarrow 0[/math] as [math]n \rightarrow \infty[/math].
- [math]P(\textrm{rank}(\hat{W})\neq \textrm{rank}(\textbf{W})) \rightarrow 1[/math] as [math]n \rightarrow \infty[/math]

- If [math]\,\lambda_n[/math] tends to zero exactly at rate [math]\,n^{-1/2}[/math], then the estimate is consistent and its error is [math]\,O_p(n^{-1/2})[/math] but the probability of estimating the correct rank is converging to a limit in [math]\,(0,1)[/math].
- [math]P(\|\hat{W}-\textbf{W}\| \geq \epsilon) \rightarrow 0[/math] as [math]n \rightarrow \infty[/math].
- [math]P(\textrm{rank}(\hat{W})\neq \textrm{rank}(\textbf{W})) \rightarrow l \in (0,1)[/math] as [math]n \rightarrow \infty[/math]

- If [math]\,\lambda_n[/math] tends to zero more slowly than [math]\,n^{-1/2}[/math], then the estimate is consistent and its rank-consistency depends on specific consistency conditions.
- [math]P(\|\hat{W}-\textbf{W}\| \geq \epsilon) \rightarrow 0[/math] as [math]n \rightarrow \infty[/math].
- [math]P(\textrm{rank}(\hat{W})\neq \textrm{rank}(\textbf{W})) \rightarrow \; 0[/math] as [math]n \rightarrow \infty[/math] if certain conditions hold.

So, if [math]\,\lambda_n[/math] tends to zero more slowly than [math]\,n^{-1/2}[/math], then there would be certain conditions under which [math]\hat{W}[/math] will be rank consistent. We can take a closer look at what these conditions are. First we must introduce some new notation. Let [math]\textbf{W} = \textbf{U}\textrm{Diag}(\textbf{s})\textbf{V}^T[/math] be the singular value decomposition of [math]\textbf{W}[/math] where [math]\textbf{U} \in \mathbb{R}^{p \times \textbf{r}}[/math], [math]\textbf{V} \in \mathbb{R}^{q \times \textbf{r}}[/math] and [math]\textbf{r} \lt \min\{p,q\}[/math] is the rank of [math]\textbf{W}[/math]. Also, let [math]\textbf{U}_\perp \in \mathbb{R}^{p \times (p-\textbf{r})}[/math], and [math]\textbf{V}_\perp \in \mathbb{R}^{q \times (q-\textbf{r})}[/math] be the orthogonal complements of [math]\textbf{U}[/math] and [math]\textbf{V}[/math]. Finally, defining the matrix [math]\Lambda \in \mathbb{R}^{(p-r) \times (q-r)}[/math] as

, we then have the following necessary and sufficient conditions for the rank consistency of [math]\hat{W}:[/math]

- Necessary condition: [math]\|\Lambda\|_2 \leq 1[/math].
- Sufficient condition: [math]\|\Lambda\|_2 \lt 1[/math].

Note that if [math]\Sigma_{mm}[/math] is proportional to identity, they are always satisfied because then [math]\Lambda=0[/math].

## Adaptive Version

Suppose we would like a rank-consistent estimator, [math]\hat{W}[/math], without any conditions such as those derived in the previous section (i.e. [math]\|\Lambda\|_2 \leq 1[/math]). We can get such an estimator by modifying the trace norm matrix approximation problem. We begin by considering the least-squares estimate [math]\textrm{vec}(\hat{W}_{LS}) = \hat{\Sigma}_{mm}^{-1}\textrm{vec}(\hat{\Sigma}_{Mz})[/math] where [math]\hat{\Sigma}_{Mz} = \frac{1}{n}\sum_iz_iM_i[/math]. Consider the full singular value decomposition of [math]\hat{W}_{LS} = U_{LS}\textrm{Diag}(s_{LS})V_{LS}^T[/math], where [math]\,U_{LS}[/math] and [math]\,V_{LS}[/math] are orthogonal square matrices and the matrix [math]\,\textrm{Diag}(s_{LS})[/math] is completed by setting the [math]\,r - \min\{p,q\}[/math] singular values in [math]\,s_{LS} \in \mathbb{R}^{\min\{p,q\}}[/math] to [math]\,n^{-1/2}[/math]. For [math]\gamma \in (0,1][/math], define

Then, we can replace [math]\| W \|_*[/math] by [math]\|AWB\|_*[/math] in the trace norm matrix approximation problem to get the adaptive trace norm matrix approximation problem. This same methodology is applied in Zou<ref name="Z2006"/> to derive an adaptive version of the LASSO. Our new optimization problem is

, and the global minimum of this problem, [math]\hat{W}[/math], is both consistent and rank consistent provided that [math]\gamma \in (0,1][/math], [math]n^{1/2}\lambda_n \rightarrow 0[/math], [math]\lambda_nn^{1/2+\gamma/2} \rightarrow \infty [/math] and there is no longer any dependency on the matrix [math]\,\Lambda_2[/math] for rank consistency.

## Simulations

The author performed some simple experiments to demonstrate the consistency results. In these experiments, he let [math]\,X[/math] and [math]\,Y[/math] be i.i.d data from a normal distribution and let [math]\textbf{W}[/math] be a low rank matrix selected at random. He then generated [math]Z = \textrm{diag}(X^T\textbf{W}Y) + \epsilon[/math], where [math]\,\epsilon \sim N(0, \sigma)[/math]. For all of the experiments, [math]\textbf{r} = 2[/math] and [math]\,p = q = 4[/math], while the sample size [math]\,n[/math] varied. The following figure (taken from the author's paper) shows the singular values for [math]\hat{W}[/math] plotted against the true singular values indicated by the dotted lines. Both the consistent case ([math]\|\Lambda\|_2 = 0.49 \lt 1[/math]) and the inconsistent case ([math]\|\Lambda\|_2 = 4.78 \gt 1[/math]) were considered and both the regular and adaptive trace norm penalization terms were used, where [math]\,\gamma = 1/2[/math] and [math]\,\gamma = 1[/math] were considered for the adaptive version. The value of [math]\,n[/math] was [math]\,10^3[/math]. The estimate was consistent if it was able to correctly estimate the singular values and it was rank consistent if the rank was correct, which was determined by the cardinality of the singular values. In the consistent case, in all three versions, the singular values and the rank were correctly estimated for a range of [math]\,\lambda[/math] values, as was expected from the consistency results. From the diagram we can see that the range of [math]\,\lambda[/math] values for which the rank was correctly estimated was much narrower for the regularized version versus the adaptive versions. The results from the inconsistent case were also reflective of the consistency results.

As another test to illustrate the consistency results, the author used the rank-consistent case from the first experiment and for [math]\,n = 10^2, 10^3, 10^4[/math] and [math]\,10^5[/math] computed the paths from [math]\,200[/math] replications. The following figure, also taken from the author's paper, shows the results, where the plots on the left are the proportion of estimates that correctly estimated the rank while the plots on the right are the logarithms of the average root mean squared estimation errors.

These simulations can be reproduced using MATLAB code available from http://www.di.ens.fr/~fbach/tracenorm.

## Further Reading

Searching for low-rank representations of data, (the field of dimensionality reduction), has broad applications. The trace norm is commonly used as proxy for the rank operator in dimensionality reduction, since the rank operator results in an NP-hard problem; so, the trace norm, too, has broad applications. But how can we be sure that switching the rank operator with the trace norm will guarantee that we will get the results that we wish (ie rank minimization)? For more information about the conditions for when the trace norm is an appropriate proxy, see <ref name="recht2010">Null Space Conditions and Thresholds for Rank Minimization (full and updated version of "Necessary and Sufficient Conditions for Success of the Nuclear Norm Heuristic for Rank Minimization"). Benjamin Recht, Weiyu Xu, and Babak Hassibi. To appear in Mathematical Programming Revised, 2010.</ref>

## References

<references />