matrix Completion with Noise: Difference between revisions

From statwiki
Jump to navigation Jump to search
m (Conversion script moved page Matrix Completion with Noise to matrix Completion with Noise: Converting page titles to lowercase)
 
(13 intermediate revisions by 4 users not shown)
Line 1: Line 1:
==Introduction==
==Introduction==


Nowadays, in many well-studied applications, we may face a situation that a few entries of a data matrix are observed, and our task highly depends on the accurate recovery of the original matrix. We are curious to find out if this is possible, and if yes, how accurate it can be performed.  
In many modern problems, we may face a situation where a only few entries of a data matrix are observed, but our task is highly dependent on the accurate recovery of the original matrix.  
We are curious to find out if this accurate recovery is possible and, if so, the accuracy at which it can be performed.  


In the current paper <ref name="self"> E. J. Candès and Y. Plan. Matrix completion with noise. Proceedings of the IEEE,2009. </ref>, Candes and Plan, discuss these questions. They review the novel literature about recovery of a low-rank matrix with an almost minimal set of entries by solving a simple nuclear-norm minimization problem.   
In the current paper <ref name="self"> E. J. Candès and Y. Plan. Matrix completion with noise. Proceedings of the IEEE,2009. </ref>, Candes and Plan discuss these questions. They review the novel literature about recovery of a low-rank matrix with an almost minimal set of entries by solving a simple nuclear-norm minimization problem.   


They also present results indicating that matrix completion and the original unknown matrix recovery are provably accurate even when small amount of noise is present and corrupts the few observed entries. The error of the recovery task is proportional to the noise level when the number of noisy samples is about <math>nr\log^{2}{n}</math>, in which <math>n</math> and <math>r</math> are the matrix dimension and rank, respectively.
They also present results indicating that matrix completion and the original unknown matrix recovery are provably accurate even when small amount of noise is present and corrupts the few observed entries. The error of the recovery task is proportional to the noise level when the number of noisy samples is about <math>nr\log^{2}{n}</math>, in which <math>n</math> and <math>r</math> are the matrix dimension and rank, respectively.


==Notation==
==Notation==
In this section, the notations used for the whole paper are introduced. Three norms of a matrix <math> X \in \mathbb{R}^{n1\times n2}</math> with singular values of <math>\{ \sigma_k \}</math> are used frequently - spectral, Frobenius, and nuclear; they are denoted by <math>\parallel X \parallel </math>, <math>\parallel X \parallel_F</math>, and <math>\parallel X \parallel_* := \Sigma_k \sigma_k </math>, respectively.      
In this section, the notation used for the whole paper is introduced. Three [http://en.wikipedia.org/wiki/Matrix_norm norms] of a matrix <math> X \in \mathbb{R}^{n_1\times n_2}</math> with singular values <math>\,\{ \sigma_k \}</math> are used frequently - spectral, [http://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm Frobenius], and nuclear; they are denoted by <math>\parallel X \parallel </math>, <math>\parallel X \parallel_F</math>, and <math>\parallel X \parallel_* := \Sigma_k \sigma_k </math>, respectively. It is interesting to note that <math>\parallel X \parallel = \|\mathbf{\sigma}\|_{\infty} </math>, <math>\parallel X \parallel_F = \|\mathbf{\sigma}\|_{2}</math>, and <math>\parallel X \parallel_* = \|\mathbf{\sigma}\|_{1}</math> where <math>\,\mathbf{\sigma}</math> is the vector of singular values.


Also, the operators for linear transformation on <math>\mathbb{R}^{n1 \times n2}</math> are denoted by calligraphic letters, for instance, identity operator an this space is shown by <math> \mathcal{I}: \mathbb{R}^{n1 \times n2} \to  \mathbb{R}^{n1 \times n2}</math>
Also, the operators for linear transformation on <math>\mathbb{R}^{n_1 \times n_2}</math> are denoted by calligraphic letters, for instance, the identity operator on this space is shown by <math> \mathcal{I}: \mathbb{R}^{n_1 \times n_2} \to  \mathbb{R}^{n_1 \times n_2}</math>


==Exact Matrix Completion==
==Exact Matrix Completion==


Given a subset of the complete set of Matrix <math>M</math> entries <math> (M \in \mathbb{R}^{n1 \times n2})</math>, we intend to recover this matrix as accurately as possible. The available information about <math>M</math> is shown by <math>\mathcal{P}_\Omega(M)</math>.  
Given a subset of the complete set of Matrix <math>M</math> entries <math> (M \in \mathbb{R}^{n_1 \times n_2})</math>, we intend to recover this matrix as accurately as possible. The available information about <math>M</math> is shown by <math>\mathcal{P}_\Omega(M)</math>.  


<center><math>
<center><math>
Line 29: Line 30:
It is worth noting that the cases in which a whole row or column is missing should be avoided. So, the entries are assumed to be sampled at random without replacements. Also, if the singular vectors of the given matrix are too sparse, there is no hope to recover the original matrix accurately.  
It is worth noting that the cases in which a whole row or column is missing should be avoided. So, the entries are assumed to be sampled at random without replacements. Also, if the singular vectors of the given matrix are too sparse, there is no hope to recover the original matrix accurately.  
    
    
The authors consider a simple situation (i.e. impose some conditions) to avoid these cases, and guarantee that the singular vectors of the matrix are spread across all coordinates. Assume the SVD of matrix <math>M =\sum_{k \in [r]} \sigma_k u_k v^{*}_{k} </math>, where <math>\sigma_{i}</math>s are singular values, and <math>u_{i}s (\in \mathbb{R}^{n1})</math>, and <math>v_{i}</math>s <math>(\in \mathbb{R}^{n2})</math> are singular vectors. They assume that <math>\parallel u_k \parallel_{l_{\infty}} \leq \sqrt{\mu_B / n1}</math> and <math>\parallel v_k \parallel_{l_{\infty}} \leq \sqrt{\mu_B / n}</math>, while <math>\mu \geq 1</math> and it is small. With sufficiently spread singular vectors, we hope to find a unique low-rank matrix satisfying the data constraints.   
The authors impose some simple conditions to avoid these cases, and guarantee that the singular vectors of the matrix are sufficiently dense. Assume the SVD of matrix <math>M =\sum_{k =1}^r \sigma_k u_k v^{*}_{k} </math>, where the <math>\,\sigma_{i}</math> are singular values, and <math>\,u_{i} \in \mathbb{R}^{n_1}</math> and <math>v_{i} \in \mathbb{R}^{n_2}</math> are singular vectors. They assume that <math>\parallel u_k \parallel_{l_{\infty}} \leq \sqrt{\mu_B / n_1}</math> and <math>\parallel v_k \parallel_{l_{\infty}} \leq \sqrt{\mu_B / n_2}</math>, where <math>\mu \geq 1</math> and is small. To see that this assumption guarantees dense vectors, consider the case where <math>\,\mu_B = 1</math> and <math>\parallel u_k \parallel_{l_{\infty}} = \sqrt{\mu_B / n_1}= \sqrt{1 / n_1}</math>.  In this case, because <math> \|u_k\|_{l_2}^2 = \sum_{j=1}^{n_1}u_{kj}^2</math> must equal 1, if the largest entry of <math>\,u_k</math> is equal to <math>\,\sqrt{1 / n_1}</math> then the largest term in the summation is <math>\,1/n_1</math>.  This implies that each entry of <math>\,u_k</math> must equal <math>\,1/n_1</math> in order for the <math>\,l_2</math> norm of <math>\,u_k</math> to sum to 1 and every entry of the singular vector <math>\,u_k</math> is non-zero.  With sufficiently spread singular vectors, we hope to find a unique low-rank matrix satisfying the data constraints.   


The recovery is performed by solving the following optimization problem.  
The recovery is performed by solving the following optimization problem.  
Line 37: Line 38:




Nuclear norm minimization is the tightest convex relaxation for the rank minimization problem (above) which is NP-hard.
Nuclear norm minimization is the tightest convex relaxation for the rank minimization problem (above) which is [http://en.wikipedia.org/wiki/NP-hard NP-hard].


<center><math>\textrm{minimize}\; \parallel X \parallel_* \; </math><br /><br />
<center><math>\textrm{minimize}\; \parallel X \parallel_* \; </math><br /><br />
Line 43: Line 44:




Noting the fact that spectral norm is dual to the nuclear norm, and comparing the LP characterization of <math>l_1</math> norm and SDP characterization of nuclear norm, the authors conclude that the above optimization problem is an SDP. It is shown by Candes and Tao <ref name="ref12">E. J. Candès and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inform. Theory, 56(5), 2053-2080. </ref> that this minimization problem can perform the recovery if it can be possible by any other method.  
Noting the fact that the spectral norm is dual to the nuclear norm, and comparing the LP characterization of the <math>l_1</math> norm and then SDP characterization of the nuclear norm, the authors conclude that the above optimization problem is an SDP. It is shown by Candes and Tao <ref name="ref12">E. J. Candès and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inform. Theory, 56(5), 2053-2080. </ref> that this minimization problem can perform the recovery if it is possible by any other method.  


The authors mention a theorem (from <ref name="ref12"> </ref>): if the matrix <math>M</math> is of rank <math>r=O(1)</math> and <math>m</math> entries are observed, there is a positive constant <math>C</math> such that if <math>m \geq C \mu^{4}_{B} n \log^{2}n </math>, then <math>M</math> is the unique solution of the optimization problem mentioned above with a high probability.
The authors mention a theorem (from <ref name="ref12"> </ref>): If the matrix <math>\,M</math> is of rank <math>\,r=O(1)</math> and <math>\,m</math> entries are observed, there is a positive constant <math>\,C</math> such that if <math>\,m \geq C \mu^{4}_{B} n \log^{2}n </math>, then <math>\,M</math> is the unique solution of the optimization problem mentioned above with a high probability.
As a side remark, one can obtain a probability of success at least <math>\,1-n^{-\beta}</math> for <math>\,\beta</math> by taking <math>\,C</math> in the previous inequality
of the form <math>\,C'\beta</math> for some universal constant <math>\,C'</math>.


In order to obtain similar results for other values of rank, instead of conditions mentioned before (for guaranteeing the singular vectors to be spread all across the coordinates), Candes and Tao <ref  name="ref12"> </ref> present the strong incoherence property (see also <ref name="ref8"> E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Found. of Comput. Math., 9 717-772. </ref>). It is shown that incoherent matrix can be recovered from a minimal set of entries, while having a small strong incoherence parameter <math>\mu</math>.
In order to obtain similar results for other values of rank, instead of the conditions mentioned before (for guaranteeing the singular vectors are spread across all coordinates), Candes and Tao <ref  name="ref12"> </ref> present the strong incoherence property (see also <ref name="ref8"> E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Found. of Comput. Math., 9 717-772. </ref>). It is shown that an incoherent matrix can be recovered from a minimal set of entries, while having a small strong incoherence parameter <math>\,\mu</math>.


Consequently, Theorem 2 by Candes and Tao states that under the same conditions as in Theorem 1, there is a constant <math>C</math> such that if <math> m \geq C\mu^{2}nrlog^{6}n</math>, then with high probability <math>M</math> is the unique solution to the norm minimization problem mentioned above.
Consequently, Theorem 2 by Candes and Tao states that under the same conditions as in Theorem 1, there is a constant <math>\,C</math> such that if <math> m \geq C\mu^{2}nr\log^{6}n</math>, then with high probability <math>\,M</math> is the unique solution to the norm minimization problem mentioned above.


==Stable Matrix Completion==
==Stable Matrix Completion==


In the real world problems, it is very possible that the existence of noise corrupts the data so that the above model is unsuitable.  This part of the paper investigates if reasonably accurate matrix completion is possible in the presence of noisy entries. The novelty of the paper comes from the results shown in this section.  The noisy model can be expressed as:  
In real world problems, the existence of noise often corrupts the data, so that the above model is unsuitable.  This part of the paper investigates if reasonably accurate matrix completion is possible in the presence of noisy entries. The novelty of the paper comes from the results shown in this section.  The noisy model can be expressed as:  


<center><math>  
<center><math>  
Line 60: Line 63:




where <math>\,Z</math> <math>(n \times n)</math> is a noise term with <math>\{ Z_{ij}:(i,j) \in \Omega \}</math>. We assume that <math>\parallel \mathcal{P}_{\Omega}(Z) \parallel_{F} \leq \delta </math> for some <math> \delta > 0 </math>.  
where <math>\,Z_{(n \times n)}</math> is a noise term with <math>\{ Z_{ij}:(i,j) \in \Omega \}</math>. We assume that <math>\parallel \mathcal{P}_{\Omega}(Z) \parallel_{F} \leq \delta </math> for some <math> \,\delta > 0 </math>.  
To recover the unknown matrix, this optimization problem should be solved:  
To recover the unknown matrix, this optimization problem should be solved:  
<center><math>\textrm{minimize}\; \parallel (X) \parallel_* \; </math><br /><br />
<math>\textrm{s.t.} \; \|\mathcal{P}_\Omega(X-Y) \|_F\leq \delta</math><br /><br /></center>
and could be rewritten as:


<center><math>\textrm{minimize}\; \parallel (X) \parallel_* \; </math><br /><br />
<center><math>\textrm{minimize}\; \parallel (X) \parallel_* \; </math><br /><br />
<math>\textrm{s.t.} \; \mathcal{P}_\Omega(X)= \mathcal{P}_\Omega(M)</math><br /><br /></center>
<math>\textrm{s.t.} \; \mathcal{P}_\Omega(X)= \mathcal{P}_\Omega(M)</math><br /><br /></center>


This is a SDP problem, and assuming  <math>\hat{M}</math> to be the solution, authors show that the reconstruction is accurate.  
This is a SDP problem, and assuming  <math>\,\hat{M}</math> is the solution, the authors show that the reconstruction is accurate.  


'''Definition (Dual certificate)''': <math>\Lambda</math> is a dual certificate if we have <math>\Lambda=\mathcal{P}_\Omega(\Lambda)</math> i.e. <math>\Lambda</math> is supported on <math>\Omega</math>, <math>\mathcal{P}_T(\Lambda)=E</math> and <math>\parallel \mathcal{P}_{T^\perp}(\Lambda) \parallel \leq 1 </math>  
'''Definition (Dual certificate)''': <math>\,\Lambda</math> is a dual certificate if we have  
*<math>\Lambda=\mathcal{P}_\Omega(\Lambda)</math> (i.e. <math>\,\Lambda</math> is supported on <math>\,\Omega),</math>  
*<math>\mathcal{P}_T(\Lambda)=E,</math>  
*<math>\parallel \mathcal{P}_{T^\perp}(\Lambda) \parallel \leq 1. </math>  


'''Theorem 7''': Under assumptions of either Theorems 1 or 2, suppose a dual certificate exists which obeys <math> \parallel \mathcal{P}_{T^{\perp}}(\Lambda) \parallel  \leq 1/2 </math> and also <math> \mathcal{P}_T \mathcal{P}_{\Omega} \mathcal{P}_T \succeq  \frac{p}2 \mathcal{I}</math>. Then <math>\hat{M}</math> obeys:
'''Theorem 7''': Under the assumptions of either Theorems 1 or 2, suppose a dual certificate exists which obeys <math> \parallel \mathcal{P}_{T^{\perp}}(\Lambda) \parallel  \leq 1/2 </math> and also <math> \mathcal{P}_T \mathcal{P}_{\Omega} \mathcal{P}_T \succeq  \frac{p}2 \mathcal{I}</math>. Then <math>\hat{M}</math> obeys:


<center><math>  
<center><math>  
     \parallel M-\hat{M}\parallel_F \leq 4 \sqrt{ \frac{C_p min(n1,n2)} {p} \delta } + 2\delta
     \parallel M-\hat{M}\parallel_F \leq 4 \sqrt{ \frac{C_p \min(n_1,n_2)} {p} \delta } + 2\delta
</math></center>  
</math></center>  


where <math>C_p = 2+p</math>. <math>p</math> is the fraction of the observed entries.
where <math>\,C_p = 2+p</math>, <math>\,p</math> is the fraction of the observed entries and <math>\,T</math> is the linear space spanned by the elements of <math>\,u_kx^*</math> and <math>yv_k^*</math> (recall the SVD of matrix <math>\,M</math>), where <math>k \in [r]</math>. Also, <math>T^{\perp}</math> is the orthogonal complement to <math>\,T</math>, and <math>\,E</math> is a "sign matrix".
<math>T</math> is the linear space spanned by elements of <math>u_kx^*</math> and <math>yv_k^*</math> (recall the SVD of matrix <math>M</math>), where <math>k \in [r]</math>. Also, <math>T^{\perp}</math> is the orthogonal component to <math>T</math>, and E is "sign matrix". 
Authors define the dual certificate as follows.
 
 
 
The theorem roughly states that when perfect noiseless recovery is performed, then the matrix completion is stable in presence of perturbations. The error is proportional to <math>\delta</math> (noise level).  




The theorem roughly states that when perfect noiseless recovery is performed, then the matrix completion is stable in presence of perturbations. The error is proportional to <math>\,\delta</math> (noise level).




===Comparison with an Oracle===
===Comparison with an Oracle===


A question to be answered is that what is the best obtainable recovery accuracy. Assume that <math>n1=n2=n</math>, and there is an oracle informing about <math>T</math>. One can conclude that <math>M</math> is in a linear space of dimension <math>2nr-r^2</math>, and the optimization problem would be solved by means of the least squares method which results in finding the matrix in <math>T</math>.  
A question the remains to be answered is what is the best obtainable recovery accuracy. Assume that <math>\,n_1=n_2=n</math>, and there is an oracle informing about <math>\,T</math>. One can conclude that <math>\,M</math> is in a linear space of dimension <math>\,2nr-r^2</math>, and the optimization problem can be solved by means of the least squares method which results in finding the matrix in <math>\,T</math>.  


<center><math>\textrm{minimize}\; \parallel \mathcal{P}_\Omega(X)-\mathcal{P}_\Omega{Y} \parallel_F \; </math><br /><br />
<center><math>\textrm{minimize}\; \parallel \mathcal{P}_\Omega(X)-\mathcal{P}_\Omega(Y) \parallel_F \; </math><br /><br />
<math>\textrm{s.t.} \; X \in T </math><br /><br /></center>
<math>\textrm{s.t.} \; X \in T </math><br /><br /></center>




If <math>\mathcal{A}: T \to \Omega </math>, and <math>\mathcal{A}:= \mathcal{P}_\Omega \mathcal{P}_T </math>. <math>\Omega </math> here is the range of <math> \mathcal{P}_\Omega </math>. Also, assume that <math> \mathcal{A}^* \mathcal{A} = \mathcal{P}_T \mathcal{P}_\Omega \mathcal{P}_T </math> is an invertible operator. By solving least-squares problem, we have:
If <math>\mathcal{A}: T \to \Omega </math>, and <math>\mathcal{A}:= \mathcal{P}_\Omega \mathcal{P}_T </math> where <math>\,\Omega </math> is the range of <math> \mathcal{P}_\Omega </math>. Also, assume that <math> \mathcal{A}^* \mathcal{A} = \mathcal{P}_T \mathcal{P}_\Omega \mathcal{P}_T </math> is an invertible operator. By solving the least-squares problem, we have:


<center><math>
<center><math>
Line 101: Line 107:
</math></center>
</math></center>


Assume the minimum eigenvalue of the <math>\mathcal{A}^*\mathcal{A}</math> is <math> \lambda_{min}</math>, and <math> Z= \delta \lambda^{-1/2}_{min} \mathcal{A}(Z^')</math>, where <math>Z^'</math> is the minimal eigenvector of <math>\mathcal{A}^*\mathcal{A}</math>. Also, <math>p/2\leq</math> (all of the eigenvalues of <math>\mathcal{A}^*\mathcal{A})\leq 3p/2</math>, and <math>\parallel Z \parallel_F = \delta</math>, and we will have:
Assume that the minimum eigenvalue of <math>\mathcal{A}^*\mathcal{A}</math> is <math>\, \lambda_{min}</math>, and <math> Z= \delta \lambda^{-1/2}_{min} \mathcal{A}(Z^')</math>, where <math>\,Z^'</math> is the minimal eigenvector of <math>\mathcal{A}^*\mathcal{A}</math>. Also, assume <math>p/2\leq</math> (all of the eigenvalues of <math>\mathcal{A}^*\mathcal{A})\leq 3p/2</math>, and <math>\parallel Z \parallel_F = \delta</math>, then:


<center><math>
<center><math>
Line 113: Line 119:
</math></center>
</math></center>


Then authors show that the above analysis also applies in the case of a stronger oracle which informs about the row space (and the rank) of the unknown matrix, and thus the achieved error is about <math>p^{-1/2}\delta </math>. Thus, it can be concluded that when the only known fact is that <math>\parallel \mathcal{P}_\Omega(Z)\parallel_F \leq \delta </math>, a root-mean squared error better than <math>p^{-1/2}\delta</math> can not be achieved.
Then authors show that the above analysis also applies in the case of a stronger oracle which informs about the row space (and the rank) of the unknown matrix, and thus the achieved error is about <math>],p^{-1/2}\delta </math>. Thus, it can be concluded that when the only known fact is <math>\parallel \mathcal{P}_\Omega(Z)\parallel_F \leq \delta </math>, a root-mean squared error better than <math>\,p^{-1/2}\delta</math> can not be achieved.


==Experiments==
==Experiments==


A series of experiments have been conducted in order to indicate the stability of matrix completion in presence of noise. The scenario is as follows: The unknown <math>n \times n</math> rank-<math>r</math> matrix is <math>M</math> (<math>M=M_LM^*_R</math>). Entries of <math>M_L</math> and <math>M_R \in \mathbb{R}^{n \times r}</math> and iid <math>N(0, \sigma^2_n := 20 / \sqrt{n})</math>.
A series of experiments have been conducted in order to indicate the stability of matrix completion in the presence of noise. The scenario is as follows: The unknown <math>n \times n</math> rank-<math>\,r</math> matrix is <math>\,M</math> (<math>M=M_LM^*_R</math>). The matrices <math>\,M_L</math> and <math>\,M_R</math> are in <math>\mathbb{R}^{n \times r}</math> and the entries are iid <math>N(0, \sigma^2_n := 20 / \sqrt{n})</math>.


The sampled set is <math>\Omega</math> which includes <math>m</math> entries picked uniformly at random. So, the fraction of the observed entries is <math>p=m/n^2</math>. Noise <math>(\{Z_{ij}\})</math> is iid <math>N(0,\sigma^2)</math> where <math>\sigma = 1</math>. Experiments have been performed by several values of <math>n</math>, <math>r</math> and <math>p</math>.
The sampled set is <math>\,\Omega</math> which includes <math>\,m</math> entries picked uniformly at random. So, the fraction of the observed entries is <math>\,p=m/n^2</math>. The noise <math>\,(\{Z_{ij}\})</math> is iid <math>\,N(0,\sigma^2)</math> where <math>\,\sigma = 1</math>. Experiments have been performed for several values of <math>\,n</math>, <math>\,r</math> and <math>\,p</math>.


In particular, we can look at the average over 20 repetitions on a rank 2 <math>n\times n</math> matrix in which 20% of the values are known. The table below<ref name="self"></ref> indicates the measured root-mean square error per entry of the matrix.  
In particular, we can look at the average over 20 repetitions on a rank-2, <math>n\times n</math> matrix in which 20% of the values are known. The table below<ref name="self"></ref> indicates the measured root-mean square error per entry of the matrix.  
<center>
<center>
{| class="wikitable"
{| class="wikitable"
Line 139: Line 145:
</center>
</center>


To understand the significance of this, suppose we were given a matrix 1000 x 1000 <math>Y = M+Z</math>, which was the true matrix with some noise. If we simply accepted <math>Y</math> without question as our true matrix and our noise had variance 1 then the expected RMSE would be 1. If, instead, we used only 20% of the matrix the RMSE is <math>0.24^2 = 0.0576</math>. This reduction in error is due to the algorithm removing noise from known entries in addition to attempting to find unknown ones.  
To understand the significance of this, suppose we were given a 1000 x 1000 matrix <math>\,Y = M+Z</math>, which was the true matrix with some noise. If we simply accepted <math>\,Y</math> without question as our true matrix and our noise had variance 1 then the expected RMSE would be 1. If, instead, we used only 20% of the matrix the RMSE is <math>\,0.24^2 = 0.0576</math><ref name="self"></ref>. This reduction in error is due to the algorithm removing noise from known entries in addition to attempting to find unknown ones.  


===Real-World Example: Daily Weather===
===Real-World Example: Daily Weather===
The paper takes a matrix that looks at daily temperatures from 1472 worldwide locations over the year (366 days, as 2008 was a leap year) giving a 366 x 1472 matrix. It was verified through SVD that, as expected, the matrix is low rank. They sampled 30% of the matrix and tried to recover it giving a relative error of 0.166 (versus an error of 0.121 for the best rank-2 approximation with exact knowledge).
The paper takes a matrix that looks at daily temperatures from 1472 worldwide locations over the year (366 days, as 2008 was a leap year) giving a 366 x 1472 matrix<ref name="self"></ref>. It was verified through SVD that, as expected, the matrix is low rank. They sampled 30% of the matrix and tried to recover it giving a relative error of 0.166 (versus an error of 0.121 for the best rank-2 approximation with exact knowledge).


==References==
==References==
<references />
<references />

Latest revision as of 09:45, 30 August 2017

Introduction

In many modern problems, we may face a situation where a only few entries of a data matrix are observed, but our task is highly dependent on the accurate recovery of the original matrix. We are curious to find out if this accurate recovery is possible and, if so, the accuracy at which it can be performed.

In the current paper <ref name="self"> E. J. Candès and Y. Plan. Matrix completion with noise. Proceedings of the IEEE,2009. </ref>, Candes and Plan discuss these questions. They review the novel literature about recovery of a low-rank matrix with an almost minimal set of entries by solving a simple nuclear-norm minimization problem.

They also present results indicating that matrix completion and the original unknown matrix recovery are provably accurate even when small amount of noise is present and corrupts the few observed entries. The error of the recovery task is proportional to the noise level when the number of noisy samples is about [math]\displaystyle{ nr\log^{2}{n} }[/math], in which [math]\displaystyle{ n }[/math] and [math]\displaystyle{ r }[/math] are the matrix dimension and rank, respectively.

Notation

In this section, the notation used for the whole paper is introduced. Three norms of a matrix [math]\displaystyle{ X \in \mathbb{R}^{n_1\times n_2} }[/math] with singular values [math]\displaystyle{ \,\{ \sigma_k \} }[/math] are used frequently - spectral, Frobenius, and nuclear; they are denoted by [math]\displaystyle{ \parallel X \parallel }[/math], [math]\displaystyle{ \parallel X \parallel_F }[/math], and [math]\displaystyle{ \parallel X \parallel_* := \Sigma_k \sigma_k }[/math], respectively. It is interesting to note that [math]\displaystyle{ \parallel X \parallel = \|\mathbf{\sigma}\|_{\infty} }[/math], [math]\displaystyle{ \parallel X \parallel_F = \|\mathbf{\sigma}\|_{2} }[/math], and [math]\displaystyle{ \parallel X \parallel_* = \|\mathbf{\sigma}\|_{1} }[/math] where [math]\displaystyle{ \,\mathbf{\sigma} }[/math] is the vector of singular values.

Also, the operators for linear transformation on [math]\displaystyle{ \mathbb{R}^{n_1 \times n_2} }[/math] are denoted by calligraphic letters, for instance, the identity operator on this space is shown by [math]\displaystyle{ \mathcal{I}: \mathbb{R}^{n_1 \times n_2} \to \mathbb{R}^{n_1 \times n_2} }[/math]

Exact Matrix Completion

Given a subset of the complete set of Matrix [math]\displaystyle{ M }[/math] entries [math]\displaystyle{ (M \in \mathbb{R}^{n_1 \times n_2}) }[/math], we intend to recover this matrix as accurately as possible. The available information about [math]\displaystyle{ M }[/math] is shown by [math]\displaystyle{ \mathcal{P}_\Omega(M) }[/math].

[math]\displaystyle{ [\mathcal{P}_\Omega(X)]_{ij} = \left\{ \begin{array}{lr} X_{ij} & : (i,j) \in \Omega\\ 0 & : \text{ otherwise} \end{array} \right. }[/math]

The problem discussed in this paper is whether the matrix can be recovered based on the given information.

It is worth noting that the cases in which a whole row or column is missing should be avoided. So, the entries are assumed to be sampled at random without replacements. Also, if the singular vectors of the given matrix are too sparse, there is no hope to recover the original matrix accurately.

The authors impose some simple conditions to avoid these cases, and guarantee that the singular vectors of the matrix are sufficiently dense. Assume the SVD of matrix [math]\displaystyle{ M =\sum_{k =1}^r \sigma_k u_k v^{*}_{k} }[/math], where the [math]\displaystyle{ \,\sigma_{i} }[/math] are singular values, and [math]\displaystyle{ \,u_{i} \in \mathbb{R}^{n_1} }[/math] and [math]\displaystyle{ v_{i} \in \mathbb{R}^{n_2} }[/math] are singular vectors. They assume that [math]\displaystyle{ \parallel u_k \parallel_{l_{\infty}} \leq \sqrt{\mu_B / n_1} }[/math] and [math]\displaystyle{ \parallel v_k \parallel_{l_{\infty}} \leq \sqrt{\mu_B / n_2} }[/math], where [math]\displaystyle{ \mu \geq 1 }[/math] and is small. To see that this assumption guarantees dense vectors, consider the case where [math]\displaystyle{ \,\mu_B = 1 }[/math] and [math]\displaystyle{ \parallel u_k \parallel_{l_{\infty}} = \sqrt{\mu_B / n_1}= \sqrt{1 / n_1} }[/math]. In this case, because [math]\displaystyle{ \|u_k\|_{l_2}^2 = \sum_{j=1}^{n_1}u_{kj}^2 }[/math] must equal 1, if the largest entry of [math]\displaystyle{ \,u_k }[/math] is equal to [math]\displaystyle{ \,\sqrt{1 / n_1} }[/math] then the largest term in the summation is [math]\displaystyle{ \,1/n_1 }[/math]. This implies that each entry of [math]\displaystyle{ \,u_k }[/math] must equal [math]\displaystyle{ \,1/n_1 }[/math] in order for the [math]\displaystyle{ \,l_2 }[/math] norm of [math]\displaystyle{ \,u_k }[/math] to sum to 1 and every entry of the singular vector [math]\displaystyle{ \,u_k }[/math] is non-zero. With sufficiently spread singular vectors, we hope to find a unique low-rank matrix satisfying the data constraints.

The recovery is performed by solving the following optimization problem.

[math]\displaystyle{ \textrm{minimize}\; \textrm{rank(X)} \; }[/math]

[math]\displaystyle{ \textrm{s.t.} \; \mathcal{P}_\Omega(X)= \mathcal{P}_\Omega(M) }[/math]


Nuclear norm minimization is the tightest convex relaxation for the rank minimization problem (above) which is NP-hard.

[math]\displaystyle{ \textrm{minimize}\; \parallel X \parallel_* \; }[/math]

[math]\displaystyle{ \textrm{s.t.} \; \mathcal{P}_\Omega(X)= \mathcal{P}_\Omega(M) }[/math]


Noting the fact that the spectral norm is dual to the nuclear norm, and comparing the LP characterization of the [math]\displaystyle{ l_1 }[/math] norm and then SDP characterization of the nuclear norm, the authors conclude that the above optimization problem is an SDP. It is shown by Candes and Tao <ref name="ref12">E. J. Candès and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inform. Theory, 56(5), 2053-2080. </ref> that this minimization problem can perform the recovery if it is possible by any other method.

The authors mention a theorem (from <ref name="ref12"> </ref>): If the matrix [math]\displaystyle{ \,M }[/math] is of rank [math]\displaystyle{ \,r=O(1) }[/math] and [math]\displaystyle{ \,m }[/math] entries are observed, there is a positive constant [math]\displaystyle{ \,C }[/math] such that if [math]\displaystyle{ \,m \geq C \mu^{4}_{B} n \log^{2}n }[/math], then [math]\displaystyle{ \,M }[/math] is the unique solution of the optimization problem mentioned above with a high probability. As a side remark, one can obtain a probability of success at least [math]\displaystyle{ \,1-n^{-\beta} }[/math] for [math]\displaystyle{ \,\beta }[/math] by taking [math]\displaystyle{ \,C }[/math] in the previous inequality of the form [math]\displaystyle{ \,C'\beta }[/math] for some universal constant [math]\displaystyle{ \,C' }[/math].

In order to obtain similar results for other values of rank, instead of the conditions mentioned before (for guaranteeing the singular vectors are spread across all coordinates), Candes and Tao <ref name="ref12"> </ref> present the strong incoherence property (see also <ref name="ref8"> E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Found. of Comput. Math., 9 717-772. </ref>). It is shown that an incoherent matrix can be recovered from a minimal set of entries, while having a small strong incoherence parameter [math]\displaystyle{ \,\mu }[/math].

Consequently, Theorem 2 by Candes and Tao states that under the same conditions as in Theorem 1, there is a constant [math]\displaystyle{ \,C }[/math] such that if [math]\displaystyle{ m \geq C\mu^{2}nr\log^{6}n }[/math], then with high probability [math]\displaystyle{ \,M }[/math] is the unique solution to the norm minimization problem mentioned above.

Stable Matrix Completion

In real world problems, the existence of noise often corrupts the data, so that the above model is unsuitable. This part of the paper investigates if reasonably accurate matrix completion is possible in the presence of noisy entries. The novelty of the paper comes from the results shown in this section. The noisy model can be expressed as:

[math]\displaystyle{ \mathcal{P}_{\Omega}(Y) = \mathcal{P}_{\Omega}(M) + \mathcal{P}_{\Omega}(Z), }[/math]


where [math]\displaystyle{ \,Z_{(n \times n)} }[/math] is a noise term with [math]\displaystyle{ \{ Z_{ij}:(i,j) \in \Omega \} }[/math]. We assume that [math]\displaystyle{ \parallel \mathcal{P}_{\Omega}(Z) \parallel_{F} \leq \delta }[/math] for some [math]\displaystyle{ \,\delta \gt 0 }[/math]. To recover the unknown matrix, this optimization problem should be solved:

[math]\displaystyle{ \textrm{minimize}\; \parallel (X) \parallel_* \; }[/math]

[math]\displaystyle{ \textrm{s.t.} \; \|\mathcal{P}_\Omega(X-Y) \|_F\leq \delta }[/math]

and could be rewritten as:

[math]\displaystyle{ \textrm{minimize}\; \parallel (X) \parallel_* \; }[/math]

[math]\displaystyle{ \textrm{s.t.} \; \mathcal{P}_\Omega(X)= \mathcal{P}_\Omega(M) }[/math]

This is a SDP problem, and assuming [math]\displaystyle{ \,\hat{M} }[/math] is the solution, the authors show that the reconstruction is accurate.

Definition (Dual certificate): [math]\displaystyle{ \,\Lambda }[/math] is a dual certificate if we have

  • [math]\displaystyle{ \Lambda=\mathcal{P}_\Omega(\Lambda) }[/math] (i.e. [math]\displaystyle{ \,\Lambda }[/math] is supported on [math]\displaystyle{ \,\Omega), }[/math]
  • [math]\displaystyle{ \mathcal{P}_T(\Lambda)=E, }[/math]
  • [math]\displaystyle{ \parallel \mathcal{P}_{T^\perp}(\Lambda) \parallel \leq 1. }[/math]

Theorem 7: Under the assumptions of either Theorems 1 or 2, suppose a dual certificate exists which obeys [math]\displaystyle{ \parallel \mathcal{P}_{T^{\perp}}(\Lambda) \parallel \leq 1/2 }[/math] and also [math]\displaystyle{ \mathcal{P}_T \mathcal{P}_{\Omega} \mathcal{P}_T \succeq \frac{p}2 \mathcal{I} }[/math]. Then [math]\displaystyle{ \hat{M} }[/math] obeys:

[math]\displaystyle{ \parallel M-\hat{M}\parallel_F \leq 4 \sqrt{ \frac{C_p \min(n_1,n_2)} {p} \delta } + 2\delta }[/math]

where [math]\displaystyle{ \,C_p = 2+p }[/math], [math]\displaystyle{ \,p }[/math] is the fraction of the observed entries and [math]\displaystyle{ \,T }[/math] is the linear space spanned by the elements of [math]\displaystyle{ \,u_kx^* }[/math] and [math]\displaystyle{ yv_k^* }[/math] (recall the SVD of matrix [math]\displaystyle{ \,M }[/math]), where [math]\displaystyle{ k \in [r] }[/math]. Also, [math]\displaystyle{ T^{\perp} }[/math] is the orthogonal complement to [math]\displaystyle{ \,T }[/math], and [math]\displaystyle{ \,E }[/math] is a "sign matrix".


The theorem roughly states that when perfect noiseless recovery is performed, then the matrix completion is stable in presence of perturbations. The error is proportional to [math]\displaystyle{ \,\delta }[/math] (noise level).


Comparison with an Oracle

A question the remains to be answered is what is the best obtainable recovery accuracy. Assume that [math]\displaystyle{ \,n_1=n_2=n }[/math], and there is an oracle informing about [math]\displaystyle{ \,T }[/math]. One can conclude that [math]\displaystyle{ \,M }[/math] is in a linear space of dimension [math]\displaystyle{ \,2nr-r^2 }[/math], and the optimization problem can be solved by means of the least squares method which results in finding the matrix in [math]\displaystyle{ \,T }[/math].

[math]\displaystyle{ \textrm{minimize}\; \parallel \mathcal{P}_\Omega(X)-\mathcal{P}_\Omega(Y) \parallel_F \; }[/math]

[math]\displaystyle{ \textrm{s.t.} \; X \in T }[/math]


If [math]\displaystyle{ \mathcal{A}: T \to \Omega }[/math], and [math]\displaystyle{ \mathcal{A}:= \mathcal{P}_\Omega \mathcal{P}_T }[/math] where [math]\displaystyle{ \,\Omega }[/math] is the range of [math]\displaystyle{ \mathcal{P}_\Omega }[/math]. Also, assume that [math]\displaystyle{ \mathcal{A}^* \mathcal{A} = \mathcal{P}_T \mathcal{P}_\Omega \mathcal{P}_T }[/math] is an invertible operator. By solving the least-squares problem, we have:

[math]\displaystyle{ \parallel M^{Oracle}-M \parallel_F = \parallel(\mathcal{A}^*\mathcal{A})^{-1} \mathcal{A}^* (Z)\parallel_F }[/math]

Assume that the minimum eigenvalue of [math]\displaystyle{ \mathcal{A}^*\mathcal{A} }[/math] is [math]\displaystyle{ \, \lambda_{min} }[/math], and [math]\displaystyle{ Z= \delta \lambda^{-1/2}_{min} \mathcal{A}(Z^') }[/math], where [math]\displaystyle{ \,Z^' }[/math] is the minimal eigenvector of [math]\displaystyle{ \mathcal{A}^*\mathcal{A} }[/math]. Also, assume [math]\displaystyle{ p/2\leq }[/math] (all of the eigenvalues of [math]\displaystyle{ \mathcal{A}^*\mathcal{A})\leq 3p/2 }[/math], and [math]\displaystyle{ \parallel Z \parallel_F = \delta }[/math], then:

[math]\displaystyle{ \parallel (\mathcal{A}^* \mathcal{A})^{-1} \mathcal{A}^* (Z) \parallel_F = \lambda^{-1/2}_{min}\delta \gtrsim p^{-1/2} \delta }[/math]

and

[math]\displaystyle{ \parallel M^{Oracle}-M \parallel_F \approx p^{-1/2}\delta }[/math]

Then authors show that the above analysis also applies in the case of a stronger oracle which informs about the row space (and the rank) of the unknown matrix, and thus the achieved error is about [math]\displaystyle{ ],p^{-1/2}\delta }[/math]. Thus, it can be concluded that when the only known fact is [math]\displaystyle{ \parallel \mathcal{P}_\Omega(Z)\parallel_F \leq \delta }[/math], a root-mean squared error better than [math]\displaystyle{ \,p^{-1/2}\delta }[/math] can not be achieved.

Experiments

A series of experiments have been conducted in order to indicate the stability of matrix completion in the presence of noise. The scenario is as follows: The unknown [math]\displaystyle{ n \times n }[/math] rank-[math]\displaystyle{ \,r }[/math] matrix is [math]\displaystyle{ \,M }[/math] ([math]\displaystyle{ M=M_LM^*_R }[/math]). The matrices [math]\displaystyle{ \,M_L }[/math] and [math]\displaystyle{ \,M_R }[/math] are in [math]\displaystyle{ \mathbb{R}^{n \times r} }[/math] and the entries are iid [math]\displaystyle{ N(0, \sigma^2_n := 20 / \sqrt{n}) }[/math].

The sampled set is [math]\displaystyle{ \,\Omega }[/math] which includes [math]\displaystyle{ \,m }[/math] entries picked uniformly at random. So, the fraction of the observed entries is [math]\displaystyle{ \,p=m/n^2 }[/math]. The noise [math]\displaystyle{ \,(\{Z_{ij}\}) }[/math] is iid [math]\displaystyle{ \,N(0,\sigma^2) }[/math] where [math]\displaystyle{ \,\sigma = 1 }[/math]. Experiments have been performed for several values of [math]\displaystyle{ \,n }[/math], [math]\displaystyle{ \,r }[/math] and [math]\displaystyle{ \,p }[/math].

In particular, we can look at the average over 20 repetitions on a rank-2, [math]\displaystyle{ n\times n }[/math] matrix in which 20% of the values are known. The table below<ref name="self"></ref> indicates the measured root-mean square error per entry of the matrix.

n 100 200 500 1000
RMSE 0.99 0.61 0.34 0.24

To understand the significance of this, suppose we were given a 1000 x 1000 matrix [math]\displaystyle{ \,Y = M+Z }[/math], which was the true matrix with some noise. If we simply accepted [math]\displaystyle{ \,Y }[/math] without question as our true matrix and our noise had variance 1 then the expected RMSE would be 1. If, instead, we used only 20% of the matrix the RMSE is [math]\displaystyle{ \,0.24^2 = 0.0576 }[/math]<ref name="self"></ref>. This reduction in error is due to the algorithm removing noise from known entries in addition to attempting to find unknown ones.

Real-World Example: Daily Weather

The paper takes a matrix that looks at daily temperatures from 1472 worldwide locations over the year (366 days, as 2008 was a leap year) giving a 366 x 1472 matrix<ref name="self"></ref>. It was verified through SVD that, as expected, the matrix is low rank. They sampled 30% of the matrix and tried to recover it giving a relative error of 0.166 (versus an error of 0.121 for the best rank-2 approximation with exact knowledge).

References

<references />