Difference between revisions of "large-Scale Supervised Sparse Principal Component Analysis"

From statwiki
Jump to: navigation, search
(3. Block Coordinate Ascent Algorithm)
(3. Block Coordinate Ascent Algorithm)
Line 47: Line 47:
Here is the algorithm:
Here is the algorithm:

Revision as of 20:28, 5 August 2013

1. Introduction

The drawbacks of most existing technique:

1 Drawbacks of Existing techniques

Existing techniques include ad-hoc methods(e.g. factor rotation techniques, simple thresholding), greedy algorithms, SCoTLASS, the regularized SVD method, SPCA, the generalized power method. These methods are based on non-convex optimization and they don't guarantee global optimum.

A semi-definite relaxation method called DSPCA can guarantee global convergence and has better performance than above algorithms, however, it is computationally expensive.

2 Contribution of this paper

This paper solves DSPCA in a computationally easier way, and hence it is a good solution for large scale data sets. This paper applies a block coordinate ascent algorithm with computational complexity [math]O(\hat{n^3})[/math], where [math]\hat{n}[/math] is the intrinsic dimension of the data. Since [math]\hat{n}[/math] could be very small compared to the dimension [math]n[/math] of the data, this algorithm is computationally easy.

2. Primal problem

The sparse PCA problem can be formulated as [math]max_x \ x^T \Sigma x - \lambda \| x \|_0 : \| x \|_2=1[/math].

This is equivalent to [math]max_z \ Tr(\Sigma Z) - \lambda \sqrt{\| Z \|_0} : Z \succeq 0, Tr Z=1, Rank(Z)=1[/math].

Replacing the [math]\sqrt{\| Z \|_0}[/math] with [math]\| Z \|_1[/math] and dropping the rank constraint gives a relaxation of the original non-convex problem:

[math]\phi = max_z Tr (\Sigma Z) - \lambda \| Z \|_1 : Z \succeq 0[/math], [math]Tr(Z)=1 \qquad (1)[/math] .

Fortunately, this relaxation approximates the original non-convex problem to a convex problem.

Here is an important theorem used by this paper:

Theorem(2.1) Let [math]\Sigma=A^T A[/math] where [math]A=(a_1,a_2,......,a_n) \in {\mathbb R}^{m \times n}[/math], we have [math]\psi = max_{\| \xi \|_2=1}[/math] [math]\sum_{i=1}^{n} (({a_i}^T \xi)^2 - \lambda)_+[/math]. An optimal non-zero pattern corresponds to the indices [math]i[/math] with [math]\lambda \lt (({a_i}^T \xi)^2-\lambda)_+[/math]

3. Block Coordinate Ascent Algorithm

There is a row-by-row algorithm applied to the problems of the form [math]min_X \ f(X)-\beta \ log(det X): \ L \leq X \leq U, X \succ 0[/math].

Problem (1) can be written as [math]{\frac 1 2} {\phi}^2 = max_X \ Tr \Sigma X - \lambda \| X \|_1 - \frac 1 2 (Tr X)^2: X \succeq 0 \qquad (2)[/math] .

In order to apply the row by row algorithm, we need to add one more term [math]\beta \ log(det X)[/math] to (2) where [math]\beta\gt 0[/math] is a penalty parameter.

That is to say, we address the problem [math]\ max_X \ Tr \Sigma X - \lambda \| X \|_1 - \frac 1 2 (Tr X)^2 + \beta \ log(det X): X \succeq 0 \qquad (3)[/math]

By matrix partitioning, we could obtain the sub-problem:

[math]\phi = max_{x,y} \ 2(y^T s- \lambda \| y \|_1) +(\sigma - \lambda)x - {\frac 1 2}(t+x)^2 + \beta \ log(x-y^T Y^{\dagger} y ):y \in R(Y)[/math].

The sub-problem can be simplified to be

Here is the algorithm: