optimal Solutions forSparse Principal Component Analysis

Introduction

Principle component analysis (PCA) is a method for finding a linear combinations of features, called principle components, corresponding to the directions that maximize variance in the data and are orthogonal to one another. PCA facilitates the interpretation of the data if their factors are just the combinations of a few latent variables, and not many or all of the original ones. Constraining the number of nonzero coefficients in PCA is known as sparse PCA. Sparse PCA has many applications in biology, finance and many machine learning problems. Sparse principal components, like principal components, are vectors that span a lower-dimensional space that explain most of variance in the original data. However, in order to find the sparse principal components using sparse PCA, it is necessary to make some sacrifices:

There is a reduction in the explained variance in the original data captured by the sparse principal components as compared to PCA.
There is a reduction in the orthogonality (independence or correlation) between the resulting variables (sparse principal components) as compared to PCA.

In this paper we are going to focus on the problem of sparse PCA which can be written as:

[math]\displaystyle{ \max_x \; x^{T}{A}x-\rho\textbf{Card}^{2}(x) }[/math]
[math]\displaystyle{ \textrm{subject} \; \textrm{to} \; \|x\|_2=1 }[/math]

where:

[math]\displaystyle{ x\in \mathbb{R}^n }[/math]
[math]\displaystyle{ A \in S_n }[/math] is the symmetric positive semidefinite sample covariance matrix
[math]\displaystyle{ \,\rho }[/math] is the parameter which controls the sparsity
[math]\displaystyle{ \textbf{Card}(x) }[/math] expresses the cardinality of [math]\displaystyle{ x }[/math].

Note that while solving the standard PCA problem is not complicated, solving Sparse PCA is NP hard.

This paper first formulates the sparse PCA problem and then derives a greedy algorithm for computing a full set of good solutions with total complexity [math]\displaystyle{ O(n^3) }[/math]. It also formulates a convex relaxation for sparse PCA and uses it to derive a tractable sufficient conditions for a vector [math]\displaystyle{ x }[/math] to be a global optimum of the above formula.

Notation

For a vector [math]\displaystyle{ \,x \in\mathbb{R}^n }[/math], [math]\displaystyle{ \|x\|_1=\sum_{i=1}^n |x_i| }[/math] and [math]\displaystyle{ \textbf{Card}(x) }[/math]is the number of non-zero coefficients of [math]\displaystyle{ x }[/math]
The support [math]\displaystyle{ \,I }[/math] of [math]\displaystyle{ \,x }[/math] is the set [math]\displaystyle{ \{i: x_i \neq 0\} }[/math] and [math]\displaystyle{ \,I^c }[/math] denotes its complement.
[math]\displaystyle{ \,\beta_{+} = \max\{\beta , 0\} }[/math].
For a symmetric matrix [math]\displaystyle{ \,X }[/math] with eigenvalues [math]\displaystyle{ \,\lambda_i }[/math],[math]\displaystyle{ \operatorname{Tr}(X)_{+}=\sum_{i=1}^{n}\max\{\lambda_i,0\} }[/math].
The vector of all ones is written [math]\displaystyle{ \textbf{1} }[/math]. The diagonal matrix with the vector [math]\displaystyle{ u }[/math] on the diagonal is written [math]\displaystyle{ \textbf{diag}(u) }[/math]
For [math]\displaystyle{ A \, }[/math], a symmetric [math]\displaystyle{ n \times n }[/math] matrix, we can define [math]\displaystyle{ \Phi(\rho) = \max_{\|x\| \leq 1} x^T A x - \rho \textbf{Card}(x) }[/math]

Sparse PCA

Sparse PCA problem can be written as:

[math]\displaystyle{ \;\Phi(\rho) }[/math]
[math]\displaystyle{ \textrm{subject} \; \textrm{to} \; \|x\|_2=1 }[/math]

The above problem is directly related to the solving the following:

[math]\displaystyle{ \max_x \; x^{T}{A}x }[/math]

[math]\displaystyle{ \textrm{subject} \; \textrm{to} \; \|x\|_2=1,\,\,\,\,\,\,\,\,\,\,\,\,\,(1) }[/math]

[math]\displaystyle{ \textbf{Card}(x)\leq k. }[/math]

Without loss of generality, assume that [math]\displaystyle{ A \geq 0 }[/math] and the features are ordered in decreasing size of variance, such that [math]\displaystyle{ A_{11} \geq \dots \geq A_{nn} }[/math]

Using duality, we can bound the solution of (1) by:

[math]\displaystyle{ \inf_{\rho \in P}\Phi(\rho)+\rho k }[/math]

where [math]\displaystyle{ P }[/math] is the set of penalty values for which [math]\displaystyle{ \Phi(\rho) }[/math] has been computed. This tells us that if we prove [math]\displaystyle{ x }[/math] is optimal for [math]\displaystyle{ \Phi(\rho) }[/math] then it is the global optimum for (1) and the cardinality of [math]\displaystyle{ x }[/math] is exactly [math]\displaystyle{ k }[/math].

Next, since [math]\displaystyle{ z^T\sigma z\leq \sigma_{11}(\sum_{i=1}^n|x_i|^2)^2 }[/math] and [math]\displaystyle{ (\sum_{i=1}^n|x_i|^2)^2\leq \|z\|^2\textbf{Card}(xz) }[/math] and using the above lemma while assuming [math]\displaystyle{ \rho \geq \Sigma_{11} }[/math], the following will be achieved:

[math]\displaystyle{ \Phi(\rho)=\textrm{max}_{\|z\|_1} \; x^{T}{A}x-\rho\textbf{Card}^{2}(x) }[/math]

[math]\displaystyle{ \leq (\Sigma_{11}-\rho)\textbf{Card}(z) }[/math]

[math]\displaystyle{ \leq 0 }[/math]

Thus, this means that under the assumption that [math]\displaystyle{ \rho \geq \Sigma_{11} }[/math], the optimal solution to [math]\displaystyle{ \Phi(\rho)is \lt math\gt x = 0 }[/math]. So, for this reason we will proceed with the contrary assumption, that [math]\displaystyle{ \rho \leq \Sigma_{11} }[/math]

Rewriting the perliminary formula in the equivelant form we'll have:

[math]\displaystyle{ \phi(\rho)= \max_{\|x\|=1}\sum_{i=1}^n((a_i^Tx)^2)-\rho)_{+} }[/math]

Here, the [math]\displaystyle{ a_i }[/math]'s are elements of a matrix [math]\displaystyle{ \mathcal{A} }[/math] such that [math]\displaystyle{ \mathcal{A}^T \mathcal{A} = A }[/math] (i.e. [math]\displaystyle{ \mathcal{A} }[/math] is the square root of [math]\displaystyle{ A }[/math]).

For more detail refer to <ref name= "afl" > Alexandre d'Aspremont, Francis Bach, and Laurent El Ghaoui. 2008. Optimal Solutions for Sparse Principal Component Analysis. J. Mach. Learn. Res. 9 (June 2008), 1269-1294. </ref>.

Greedy Solution

As estimating [math]\displaystyle{ n-k }[/math] eigenvalues at each iteration is costly, we get help from the fact that [math]\displaystyle{ uu^T }[/math] is a subgradient of [math]\displaystyle{ \lambda_{max} }[/math] at [math]\displaystyle{ X }[/math] if [math]\displaystyle{ u }[/math] is a leading eigenvector of [math]\displaystyle{ X }[/math] to get:

[math]\displaystyle{ \lambda_{max}(\sum_{j\in I_k\cup \{i\}}a_ja_j^T)\geq \lambda_{max}(\sum_{j\in I_k}a_ja_j^T)+(x_k^Ta_i)^2 }[/math]

We the derive the following algorithm:

Approximate Greedy Search Algorithm

Input: [math]\displaystyle{ \Sigma \in \textbf{R}^{n\times n} }[/math]

Algorithm:

1.Preprocessing: sort variables decreasingly diagonal elements and permute elements of [math]\displaystyle{ \Sigma }[/math] accordingly. Compute Cholesky decomposition [math]\displaystyle{ \Sigma =A^TA }[/math].

2.Initialization:[math]\displaystyle{ I_1=\{\}, x_1=a_1/\|a_1\| }[/math]

3.Compute [math]\displaystyle{ i_k= {\arg\max}_{i\notin I_k}(x_k^Ta_i)^2 }[/math]

4.Set [math]\displaystyle{ I_{k+1}=I_k\cup\{i_k\} }[/math] and compute [math]\displaystyle{ x_{k+1} }[/math] as the leading eigenvector of [math]\displaystyle{ \sum_{j\in I_{k+1}}a_j a_j^T }[/math]

5.Set [math]\displaystyle{ k=k+1 }[/math] if [math]\displaystyle{ k\lt n }[/math] go back to step 3.

Output: sparsity patterns [math]\displaystyle{ I_k }[/math]

Convex Relaxation

As it is mentioned sparse PCA problem can be written as:

[math]\displaystyle{ \phi(\rho)= \max_{\|x\|=1}\sum_{i=1}^n((a_i^Tx)^2)-\rho)_{+} }[/math]

reformulate the problem by changing [math]\displaystyle{ X=xx^T }[/math],thus we rewrite the equivalant:

[math]\displaystyle{ \phi(\rho)= \max\sum_{i=1}^n((a_i^TXa_i)^2)-\rho)_{+} }[/math]

[math]\displaystyle{ s.t. \textbf{Tr}(X)=1,\textbf{Rank}(X)=1, X\geq 0 }[/math]

As the goal is to maximize a convex function over the convex set,[math]\displaystyle{ \Delta_n=\{X\in S_n : \textbf{Tr}(X)=1, X\geq 0\} }[/math] the solution should be an exterme point of [math]\displaystyle{ \Delta_n }[/math]. unfortunately the adressed problem which we are going to solve is convex in [math]\displaystyle{ X }[/math] and not concave. So the problem is still harf to solve. How ever It is shown on <ref name="afl"/> that on rank one elements of [math]\displaystyle{ \Delta_n }[/math], it is equal to a concave function [math]\displaystyle{ X }[/math] and we use this to produce a semidefinite relaxation of the problem. The proposition is as below and the proof is provided in the paper.

Proposition 1 Let [math]\displaystyle{ A\in{R}^{n\times n}, \rho \geq0 }[/math] and denotes by [math]\displaystyle{ a_1,...,a_n\in R^n }[/math] the columns of [math]\displaystyle{ A }[/math] an upper bound on:

[math]\displaystyle{ \phi(\rho)= \max\sum_{i=1}^n((a_i^TXa_i)^2)-\rho)_{+} }[/math]

[math]\displaystyle{ s.t. \textbf{Tr}(X)=1,\textbf{Rank}(X)=1, X\geq 0 }[/math]

can be computed by solving

[math]\displaystyle{ \psi(\rho)= \max\sum_{i=1}^n(\textbf{Tr}(X^{1/2}B_iX^{1/2})_{+} }[/math]

[math]\displaystyle{ s.t. \textbf{Tr}(X)=1, X\geq 0 }[/math]

in the variable [math]\displaystyle{ X\in S_n }[/math]where [math]\displaystyle{ B_i=a_ia_i^T-\rho I }[/math] or also:

[math]\displaystyle{ \psi(\rho)= \max\sum_{i=1}^n(\textbf{Tr}(P_iB_i)_{+} }[/math]

[math]\displaystyle{ s.t. \textbf{Tr}(X)=1, X\geq 0, X\geq P_i \geq 0 }[/math]

which is semidefinite programming in the variables [math]\displaystyle{ X\in S_n, P_i\in S_n }[/math]

Application

In this section we mention the application of sparse PCA to subset selection. One of the other application is compressed sensing which you can see more detail about it on the main paper:

Subset selection

we consider [math]\displaystyle{ p }[/math] datapoints in [math]\displaystyle{ R^n }[/math] in a data matrix [math]\displaystyle{ X \in R^{p\times n} }[/math]. We are given real numbers [math]\displaystyle{ y \in R^p }[/math] to predict from [math]\displaystyle{ X }[/math] using linear regression, estimated by least squares. In subset selection problem we are looking for sparse coefficient [math]\displaystyle{ w }[/math] i.e a vector [math]\displaystyle{ w }[/math] with many zeros. we thus consider the problem:

[math]\displaystyle{ s(k)=\min_{w\in R^n, \textbf{Card}(w)\leq k}\|y-Xw\|^2 }[/math]

using sparsity patter [math]\displaystyle{ u\in \{0,1\} }[/math] and optimizing with respect to [math]\displaystyle{ w }[/math] and rewriting the formula using generalized eigenvalue we thus have:

[math]\displaystyle{ s(k)=\|y\|^2- \max_{u\in \{0,1\}, \textbf{1}^Tu\leq k}\max_{w\in R^n}\frac{w^T\textbf{diag}(u)X^Tyy^TX\textbf{diag}(u)w}{w^TX(u)^TX(u)w} }[/math]

the simple bound on the optimal value of the subset selection problem with the following form:

[math]\displaystyle{ w^T(X(v)^Tyy^TX(v)-s_0X(v)^TX(v))w\leq B }[/math]

where [math]\displaystyle{ B\geq0 }[/math], thus we have :

[math]\displaystyle{ \|y\|^2-s_0\geq(k)\geq\|y\|^2-s_0-B(\min_{v\in\{0,1\}^n,1^Tv=k}\lambda_{min}(X(v)^TX(v)))^{-1} }[/math]

[math]\displaystyle{ \geq\|y\|^2-s_0-B(\lambda_{min}(X^TX))^{-1} }[/math]

This bound gives a sufficient condition for optimality in subset selection, for any problem instances and any given subset.

Conclusion

Here present a new convex relaxation of sparse principal component analysis. The sufficient conditions are derived for optimality <ref name="afl"/>. Those conditions go together with efficient greedy algorithms that provide candidate solutions, many of which turn out to be optimal in practice. Finally the resulting upper bound has direct application to problems such as sparse recovery and subset selection.

References