large-Scale Supervised Sparse Principal Component Analysis: Difference between revisions
Line 21: | Line 21: | ||
Replacing the <math>\sqrt{\| Z \|_0}</math> with <math>\| Z \|_1</math> and dropping the rank constraint gives a relaxation of the original non-convex problem: | Replacing the <math>\sqrt{\| Z \|_0}</math> with <math>\| Z \|_1</math> and dropping the rank constraint gives a relaxation of the original non-convex problem: | ||
<math>max_z | <math>max_z Tr (\Sigma Z) - \lambda \| Z \|_1 : Z \succeq 0</math>, <math>Tr(Z)=1</math>. | ||
Fortunately, this relaxation approximates the original non-convex problem to a convex problem. | Fortunately, this relaxation approximates the original non-convex problem to a convex problem. | ||
Here is an important theorem used by this paper: | |||
Theorem(2.1) Let <math>\Sigma=A^T A</math> where <math>A=(a_1,a_2,......,a_n) \in {\mathbb R}^{m \times n}</math>, we have <math>\psi = max_{\| \xi \|_2=1}</math> <math>\sum_{i=1}^{n} (({a_i}^T \xi)^2 - \lambda)_+</math>. An optimal non-zero pattern corresponds to the indices <math>i</math> with <math>\lambda < (({a_i}^T \xi)^2-\lambda)_+</math> |
Revision as of 05:35, 5 August 2013
1. Introduction
The drawbacks of most existing technique:
1 Drawbacks of Existing techniques
Existing techniques include ad-hoc methods(e.g. factor rotation techniques, simple thresholding), greedy algorithms, SCoTLASS, the regularized SVD method, SPCA, the generalized power method. These methods are based on non-convex optimization and they don't guarantee global optimum.
A semi-definite relaxation method called DSPCA can guarantee global convergence and has better performance than above algorithms, however, it is computationally expensive.
2 Contribution of this paper
This paper solves DSPCA in a computationally easier way, and hence it is a good solution for large scale data sets. This paper applies a block coordinate ascent algorithm with computational complexity [math]\displaystyle{ O(\hat{n^3}) }[/math], where [math]\displaystyle{ \hat{n} }[/math] is the intrinsic dimension of the data. Since [math]\displaystyle{ \hat{n} }[/math] could be very small compared to the dimension [math]\displaystyle{ n }[/math] of the data, this algorithm is computationally easy.
2. Primal problem
The sparse PCA problem can be formulated as [math]\displaystyle{ max_x \ x^T \Sigma x - \lambda \| x \|_0 : \| x \|_2=1 }[/math].
This is equivalent to [math]\displaystyle{ max_z }[/math] Tr[math]\displaystyle{ \Sigma Z - \lambda \sqrt{\| Z \|_0} : Z \succeq 0 }[/math], Tr [math]\displaystyle{ Z=1 }[/math], Rank[math]\displaystyle{ (Z)=1 }[/math].
Replacing the [math]\displaystyle{ \sqrt{\| Z \|_0} }[/math] with [math]\displaystyle{ \| Z \|_1 }[/math] and dropping the rank constraint gives a relaxation of the original non-convex problem:
[math]\displaystyle{ max_z Tr (\Sigma Z) - \lambda \| Z \|_1 : Z \succeq 0 }[/math], [math]\displaystyle{ Tr(Z)=1 }[/math].
Fortunately, this relaxation approximates the original non-convex problem to a convex problem.
Here is an important theorem used by this paper:
Theorem(2.1) Let [math]\displaystyle{ \Sigma=A^T A }[/math] where [math]\displaystyle{ A=(a_1,a_2,......,a_n) \in {\mathbb R}^{m \times n} }[/math], we have [math]\displaystyle{ \psi = max_{\| \xi \|_2=1} }[/math] [math]\displaystyle{ \sum_{i=1}^{n} (({a_i}^T \xi)^2 - \lambda)_+ }[/math]. An optimal non-zero pattern corresponds to the indices [math]\displaystyle{ i }[/math] with [math]\displaystyle{ \lambda \lt (({a_i}^T \xi)^2-\lambda)_+ }[/math]