optimal Solutions forSparse Principal Component Analysis: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 80: Line 80:




Then, xaximizing in <math>\,u</math> and using the fact that <math>\,max_{v \in \{0,1\}} \beta v = \beta_+</math>, the SPCA problem in the case when <math>\,\rho \le A_{11}</math> becomes:
Then, maximizing in <math>\,u</math> and using the fact that <math>\,max_{v \in \{0,1\}} \beta v = \beta_+</math>, the SPCA problem in the case when <math>\,\rho \le A_{11}</math> becomes:
   
   
<center><math>\phi(\rho)= \max_{\|x\|=1}\sum_{i=1}^n((b_i^Tx)^2-\rho)_{+}</math></center>, which is a non-convex problem in <math>\,x \in R^n</math> and in which we then select variables <math>\,i</math> such that <math>\,(b_i^T x)^2 − \rho > 0</math>.
<center><math>\phi(\rho)= \max_{\|x\|=1}\sum_{i=1}^n((b_i^Tx)^2-\rho)_{+}</math></center>, which is a non-convex problem in <math>\,x \in R^n</math> and in which we then select variables <math>\,i</math> such that <math>\,(b_i^T x)^2 − \rho > 0</math>.

Revision as of 18:58, 6 December 2010

Introduction

Principle component analysis (PCA) is a method for finding a linear combination of features, called principle components, that correspond to the directions that maximize the variance in the data and which are orthogonal to one another. In practice, performing PCA on a data set involves applying singular value decomposition on the data matrix.

PCA facilitates the interpretation of the data if their factors are just the combinations of a few latent variables, and not many or all of the original ones. This is particularly true in many applications in which the coordinate axes that correspond to the factors have a direct physical interpretation; for instance, in financial or biological applications, each axis might correspond to a specific asset or to a specific gene. Constraining the number of non-zero factor coefficients (loadings) in the sparse principal components to a very low number relative to the total number of coefficients whilst having these sparse vectors explain a maximum amount of variance in the data is known as sparse PCA. Sparse PCA has many applications in biology, finance and many machine learning problems. Sparse principal components, like principal components, are vectors that span a lower-dimensional space that explain most of variance in the original data. However, in order to find the sparse principal components using sparse PCA, it is necessary to make some sacrifices:

  • There is a reduction in the explained variance in the original data captured by the sparse principal components as compared to PCA.
  • There is a reduction in the orthogonality (independence or correlation) between the resulting variables (sparse principal components) as compared to PCA.

In this paper we are going to focus on the problem of sparse PCA which can be written as:

[math]\displaystyle{ \max_x \; x^{T}{A}x-\rho\textbf{Card}^{2}(x) }[/math]
[math]\displaystyle{ \textrm{subject} \; \textrm{to} \; \|x\|_2 \le 1 }[/math]

where:

  • [math]\displaystyle{ x\in \mathbb{R}^n }[/math]
  • [math]\displaystyle{ A \in S_n }[/math] is the symmetric positive semidefinite sample covariance matrix
  • [math]\displaystyle{ \,\rho }[/math] is the parameter which controls the sparsity
  • [math]\displaystyle{ \textbf{Card}(x) }[/math] expresses the cardinality ([math]\displaystyle{ \,l_0 }[/math] norm) of [math]\displaystyle{ \,x }[/math].

Note that while solving the standard PCA problem is not complicated (since, for each factor, one simply needs to find a leading eigenvector, and this can be done in [math]\displaystyle{ \,O(n^2) }[/math] time), solving sparse PCA is NP hard (since sparse PCA is a particular case of the sparse generalized eigenvalue problem).

The main part of this paper begins by formulating the sparse PCA (SPCA) problem, whose algorithm is based on the representation of PCA as a regression-type optimization problem (Zou et al., 2006) that allows the application of the LASSO (Tibshirani, 1996) (which is a penalization technique based on the [math]\displaystyle{ \,l_1 }[/math] norm). The main part of this paper then derives a greedy algorithm for computing a full set of good solutions with total complexity [math]\displaystyle{ \,O(n^3) }[/math]. It also formulates a convex relaxation for sparse PCA and uses it to derive tractable sufficient conditions for a vector [math]\displaystyle{ \,x }[/math] to be a global optimum of the above optimization problem. In the general approach to SPCA described in this paper, for a given vector [math]\displaystyle{ \,x }[/math] having support [math]\displaystyle{ \,I }[/math], [math]\displaystyle{ \,x }[/math] can be tested to see if it is a globally optimal solution to the above optimization problem simply by performing a few steps of binary search to solve a one-dimensional convex minimization problem.

Notation

  • For a vector [math]\displaystyle{ \,x \in\mathbb{R}^n }[/math], [math]\displaystyle{ \|x\|_1=\sum_{i=1}^n |x_i| }[/math] and [math]\displaystyle{ \textbf{Card}(x) }[/math] is the cardinality of [math]\displaystyle{ \,x }[/math] (the number of non-zero coefficients of [math]\displaystyle{ \,x }[/math]).
  • The support [math]\displaystyle{ \,I }[/math] of [math]\displaystyle{ \,x }[/math] is the set [math]\displaystyle{ \{i: x_i \neq 0\} }[/math] and [math]\displaystyle{ \,I^c }[/math] denotes its complement.
  • [math]\displaystyle{ \,\beta_{+} = \max\{\beta , 0\} }[/math].
  • For a symmetric [math]\displaystyle{ n \times n }[/math] matrix [math]\displaystyle{ \,X }[/math] with eigenvalues [math]\displaystyle{ \,\lambda_i }[/math], [math]\displaystyle{ \operatorname{Tr}(X)_{+}=\sum_{i=1}^{n}\max\{\lambda_i,0\} }[/math].
  • The vector of all ones is written [math]\displaystyle{ \textbf{1} }[/math], and the identity matrix is written [math]\displaystyle{ \,\textbf{I} }[/math]. The diagonal matrix with the vector [math]\displaystyle{ \,u }[/math] on the diagonal is written [math]\displaystyle{ \textbf{diag}(u) }[/math].
  • For [math]\displaystyle{ A \, }[/math], a symmetric [math]\displaystyle{ n \times n }[/math] matrix, we can define [math]\displaystyle{ \phi(\rho) = \max_{\|x\| \leq 1} x^T A x - \rho \textbf{Card}(x) }[/math].

Sparse PCA

Sparse PCA problem can be written as:

[math]\displaystyle{ \;\phi(\rho) }[/math]
[math]\displaystyle{ \textrm{subject} \; \textrm{to} \; \|x\|_2=1 }[/math]

It is assumed that we have a square root [math]\displaystyle{ \,B }[/math] of [math]\displaystyle{ \,A }[/math] with [math]\displaystyle{ \,A = B^TB }[/math], where [math]\displaystyle{ \,B \in R^{n \times n} }[/math].

The above problem is directly related to the following problem which involves finding a cardinality-constrained maximum eigenvalue:

[math]\displaystyle{ \max_x \; x^{T}{A}x }[/math]

[math]\displaystyle{ \textrm{subject} \; \textrm{to} \; \|x\|_2=1,\,\,\,\,\,\,\,\,\,\,\,\,\,(1) }[/math]

[math]\displaystyle{ \textbf{Card}(x)\leq k. }[/math]

, in the variable [math]\displaystyle{ \,x \in R^n }[/math]. Without loss of generality, assume that [math]\displaystyle{ A \geq 0 }[/math] and the features are ordered in decreasing size of variance, i.e. [math]\displaystyle{ A_{11} \geq \dots \geq A_{nn} }[/math].

Using duality, we can bound the solution of (1) by:

[math]\displaystyle{ \inf_{\rho \in P}\phi(\rho)+\rho k }[/math]

where [math]\displaystyle{ \,P }[/math] is the set of penalty values for which [math]\displaystyle{ \,\phi(\rho) }[/math] has been computed. This tells us that if we prove [math]\displaystyle{ \,x }[/math] is optimal for [math]\displaystyle{ \,\phi(\rho) }[/math] then [math]\displaystyle{ \,x }[/math] is the global optimum for [math]\displaystyle{ \,(1) }[/math], with the cardinality of [math]\displaystyle{ \,x }[/math] being exactly [math]\displaystyle{ \,k }[/math].

Next, since [math]\displaystyle{ x^T A x\leq A_{11}(\sum_{i=1}^n|x_i|)^2 }[/math] and [math]\displaystyle{ (\sum_{i=1}^n|x_i|)^2 \leq \|x\|^2\textbf{Card}(x) \; \forall x \in R^n }[/math] and using the above lemma while assuming [math]\displaystyle{ \rho \geq A_{11} }[/math], the following will be achieved:

[math]\displaystyle{ \phi(\rho)=\textrm{max}_{\|x\| \le 1} \; x^{T}{A}x-\rho\textbf{Card}^{2}(x) }[/math]

[math]\displaystyle{ \leq (A_{11}-\rho)\textbf{Card}(x) }[/math]

[math]\displaystyle{ \leq 0 }[/math]

.

When [math]\displaystyle{ \,\rho \ge A_{11} }[/math], the optimal solution to the SPCA problem is simply [math]\displaystyle{ \,x = 0 }[/math].

Now we shall look at the case when [math]\displaystyle{ \,\rho \le A_{11} }[/math]. In this case, the inequality [math]\displaystyle{ \,\|x\| \le 1 }[/math] is tight. Using the fact that the sparsity pattern of a vector [math]\displaystyle{ \,x }[/math] can be represented by a vector [math]\displaystyle{ \,u \in \{0, 1\}^n }[/math], the fact that [math]\displaystyle{ \,\textbf{diag}(u)^2 = \textbf{diag}(u) }[/math] for all variables [math]\displaystyle{ \,u \in \{0, 1\}^n }[/math], and the fact that for any matrix [math]\displaystyle{ \,B }[/math], [math]\displaystyle{ \,\lambda_{max}(B^TB) = \lambda_{max}(BB^T) }[/math], the SPCA problem can be re-expressed as:


[math]\displaystyle{ \,\phi(\rho) \; = \max_{u \in \{0,1\}^n} \; \lambda_{max}(\textbf{diag}(u) \; A \; \textbf{diag}(u)) - \rho\textbf{1}^Tu }[/math]

[math]\displaystyle{ = \max_{u \in \{0,1\}^n} \; \lambda_{max}(\textbf{diag}(u) \; B^TB \; \textbf{diag}(u)) - \rho\textbf{1}^Tu }[/math]
[math]\displaystyle{ = \max_{u \in \{0,1\}^n} \; \lambda_{max}(B \; \textbf{diag}(u) \; B^T) - \rho\textbf{1}^Tu }[/math]
[math]\displaystyle{ = \max_{ \|x\| = 1} \; \max_{u \in \{0,1\}^n} x^T B \; \textbf{diag}(u) \; B^T x - \rho\textbf{1}^Tu }[/math]
[math]\displaystyle{ = \max_{ \|x\| = 1} \; \max_{u \in \{0,1\}^n} \sum_{i=1}^n u_i((b_i^T x)^2 - \rho) }[/math].


Then, maximizing in [math]\displaystyle{ \,u }[/math] and using the fact that [math]\displaystyle{ \,max_{v \in \{0,1\}} \beta v = \beta_+ }[/math], the SPCA problem in the case when [math]\displaystyle{ \,\rho \le A_{11} }[/math] becomes:

[math]\displaystyle{ \phi(\rho)= \max_{\|x\|=1}\sum_{i=1}^n((b_i^Tx)^2-\rho)_{+} }[/math]

, which is a non-convex problem in [math]\displaystyle{ \,x \in R^n }[/math] and in which we then select variables [math]\displaystyle{ \,i }[/math] such that [math]\displaystyle{ \,(b_i^T x)^2 − \rho \gt 0 }[/math].

Here, the [math]\displaystyle{ b_i }[/math]'s are elements of a matrix [math]\displaystyle{ \,B }[/math] (as mentioned above) such that [math]\displaystyle{ \,B^T B = A }[/math] (i.e. [math]\displaystyle{ \,B }[/math] is the square root of [math]\displaystyle{ \,A }[/math]).

For more detail refer to <ref name= "afl" > Alexandre d'Aspremont, Francis Bach, and Laurent El Ghaoui. 2008. Optimal Solutions for Sparse Principal Component Analysis. J. Mach. Learn. Res. 9 (June 2008), 1269-1294. </ref>.

Greedy Solution

As estimating [math]\displaystyle{ n-k }[/math] eigenvalues at each iteration is costly, we get help from the fact that [math]\displaystyle{ uu^T }[/math] is a subgradient of [math]\displaystyle{ \lambda_{max} }[/math] at [math]\displaystyle{ X }[/math] if [math]\displaystyle{ u }[/math] is a leading eigenvector of [math]\displaystyle{ X }[/math] to get:

[math]\displaystyle{ \lambda_{max}(\sum_{j\in I_k\cup \{i\}}a_ja_j^T)\geq \lambda_{max}(\sum_{j\in I_k}a_ja_j^T)+(x_k^Ta_i)^2 }[/math]

We the derive the following algorithm:

Approximate Greedy Search Algorithm

Input: [math]\displaystyle{ \Sigma \in \textbf{R}^{n\times n} }[/math]

Algorithm:

1.Preprocessing: sort variables decreasingly diagonal elements and permute elements of [math]\displaystyle{ \Sigma }[/math] accordingly. Compute Cholesky decomposition [math]\displaystyle{ \Sigma =A^TA }[/math].

2.Initialization:[math]\displaystyle{ I_1=\{\}, x_1=a_1/\|a_1\| }[/math]

3.Compute [math]\displaystyle{ i_k= {\arg\max}_{i\notin I_k}(x_k^Ta_i)^2 }[/math]

4.Set [math]\displaystyle{ I_{k+1}=I_k\cup\{i_k\} }[/math] and compute [math]\displaystyle{ x_{k+1} }[/math] as the leading eigenvector of [math]\displaystyle{ \sum_{j\in I_{k+1}}a_j a_j^T }[/math]

5.Set [math]\displaystyle{ k=k+1 }[/math] if [math]\displaystyle{ k\lt n }[/math] go back to step 3.

Output: sparsity patterns [math]\displaystyle{ I_k }[/math]

Convex Relaxation

As it is mentioned sparse PCA problem can be written as:

[math]\displaystyle{ \phi(\rho)= \max_{\|x\|=1}\sum_{i=1}^n((a_i^Tx)^2)-\rho)_{+} }[/math]

reformulate the problem by changing [math]\displaystyle{ X=xx^T }[/math],thus we rewrite the equivalant:

[math]\displaystyle{ \phi(\rho)= \max\sum_{i=1}^n((a_i^TXa_i)^2)-\rho)_{+} }[/math]


[math]\displaystyle{ s.t. \textbf{Tr}(X)=1,\textbf{Rank}(X)=1, X\geq 0 }[/math]

As the goal is to maximize a convex function over the convex set,[math]\displaystyle{ \Delta_n=\{X\in S_n : \textbf{Tr}(X)=1, X\geq 0\} }[/math] the solution should be an exterme point of [math]\displaystyle{ \Delta_n }[/math]. unfortunately the adressed problem which we are going to solve is convex in [math]\displaystyle{ X }[/math] and not concave. So the problem is still harf to solve. How ever It is shown on <ref name="afl"/> that on rank one elements of [math]\displaystyle{ \Delta_n }[/math], it is equal to a concave function [math]\displaystyle{ X }[/math] and we use this to produce a semidefinite relaxation of the problem. The proposition is as below and the proof is provided in the paper.

Proposition 1 Let [math]\displaystyle{ A\in{R}^{n\times n}, \rho \geq0 }[/math] and denotes by [math]\displaystyle{ a_1,...,a_n\in R^n }[/math] the columns of [math]\displaystyle{ A }[/math] an upper bound on:

[math]\displaystyle{ \phi(\rho)= \max\sum_{i=1}^n((a_i^TXa_i)^2)-\rho)_{+} }[/math]


[math]\displaystyle{ s.t. \textbf{Tr}(X)=1,\textbf{Rank}(X)=1, X\geq 0 }[/math]

can be computed by solving

[math]\displaystyle{ \psi(\rho)= \max\sum_{i=1}^n(\textbf{Tr}(X^{1/2}B_iX^{1/2})_{+} }[/math]


[math]\displaystyle{ s.t. \textbf{Tr}(X)=1, X\geq 0 }[/math]

in the variable [math]\displaystyle{ X\in S_n }[/math]where [math]\displaystyle{ B_i=a_ia_i^T-\rho I }[/math] or also:

[math]\displaystyle{ \psi(\rho)= \max\sum_{i=1}^n(\textbf{Tr}(P_iB_i)_{+} }[/math]


[math]\displaystyle{ s.t. \textbf{Tr}(X)=1, X\geq 0, X\geq P_i \geq 0 }[/math]

which is semidefinite programming in the variables [math]\displaystyle{ X\in S_n, P_i\in S_n }[/math]

Application

In this section we mention the application of sparse PCA to subset selection. One of the other application is compressed sensing which you can see more detail about it on the main paper:

Subset selection

we consider [math]\displaystyle{ p }[/math] datapoints in [math]\displaystyle{ R^n }[/math] in a data matrix [math]\displaystyle{ X \in R^{p\times n} }[/math]. We are given real numbers [math]\displaystyle{ y \in R^p }[/math] to predict from [math]\displaystyle{ X }[/math] using linear regression, estimated by least squares. In subset selection problem we are looking for sparse coefficient [math]\displaystyle{ w }[/math] i.e a vector [math]\displaystyle{ w }[/math] with many zeros. we thus consider the problem:

[math]\displaystyle{ s(k)=\min_{w\in R^n, \textbf{Card}(w)\leq k}\|y-Xw\|^2 }[/math]

using sparsity patter [math]\displaystyle{ u\in \{0,1\} }[/math] and optimizing with respect to [math]\displaystyle{ w }[/math] and rewriting the formula using generalized eigenvalue we thus have:

[math]\displaystyle{ s(k)=\|y\|^2- \max_{u\in \{0,1\}, \textbf{1}^Tu\leq k}\max_{w\in R^n}\frac{w^T\textbf{diag}(u)X^Tyy^TX\textbf{diag}(u)w}{w^TX(u)^TX(u)w} }[/math]

the simple bound on the optimal value of the subset selection problem with the following form:

[math]\displaystyle{ w^T(X(v)^Tyy^TX(v)-s_0X(v)^TX(v))w\leq B }[/math]

where [math]\displaystyle{ B\geq0 }[/math], thus we have :

[math]\displaystyle{ \|y\|^2-s_0\geq(k)\geq\|y\|^2-s_0-B(\min_{v\in\{0,1\}^n,1^Tv=k}\lambda_{min}(X(v)^TX(v)))^{-1} }[/math]


[math]\displaystyle{ \geq\|y\|^2-s_0-B(\lambda_{min}(X^TX))^{-1} }[/math]

This bound gives a sufficient condition for optimality in subset selection, for any problem instances and any given subset.

Conclusion

Here present a new convex relaxation of sparse principal component analysis. The sufficient conditions are derived for optimality <ref name="afl"/>. Those conditions go together with efficient greedy algorithms that provide candidate solutions, many of which turn out to be optimal in practice. Finally the resulting upper bound has direct application to problems such as sparse recovery and subset selection.

References

<references />