optimal Solutions forSparse Principal Component Analysis

Introduction

Principle component analysis (PCA) is a method for finding a linear combination of features, called principle components, that correspond to the directions that maximize the variance in the data and which are orthogonal to one another. In practice, performing PCA on a data set involves applying singular value decomposition on the data matrix.

PCA facilitates the interpretation of the data if their factors are just the combinations of a few latent variables, and not many or all of the original ones. This is particularly true in many applications in which the coordinate axes that correspond to the factors have a direct physical interpretation; for instance, in financial or biological applications, each axis might correspond to a specific asset or to a specific gene. Constraining the number of non-zero factor coefficients (loadings) in the sparse principal components to a very low number relative to the total number of coefficients whilst having these sparse vectors explain a maximum amount of variance in the data is known as sparse PCA. Sparse PCA has many applications in biology, finance and many machine learning problems. Sparse principal components, like principal components, are vectors that span a lower-dimensional space that explain most of variance in the original data. However, in order to find the sparse principal components using sparse PCA, it is necessary to make some sacrifices:

There is a reduction in the explained variance in the original data captured by the sparse principal components as compared to PCA.
There is a reduction in the orthogonality (independence or correlation) between the resulting variables (sparse principal components) as compared to PCA.

In this paper we are going to focus on the problem of sparse PCA which can be written as:

[math]\displaystyle{ \max_x \; x^{T}{\Sigma}x-\rho\textbf{Card}^{2}(x) }[/math]
[math]\displaystyle{ \textrm{subject} \; \textrm{to} \; \|x\|_2 \le 1 }[/math]

where:

[math]\displaystyle{ x\in \mathbb{R}^n }[/math]
[math]\displaystyle{ A \in S_n }[/math] is the symmetric positive semidefinite sample covariance matrix
[math]\displaystyle{ \,\rho }[/math] is the parameter which controls the sparsity
[math]\displaystyle{ \textbf{Card}(x) }[/math] expresses the cardinality ([math]\displaystyle{ \,l_0 }[/math] norm) of [math]\displaystyle{ \,x }[/math].

Note that while solving the standard PCA problem is not complicated (since, for each factor, one simply needs to find a leading eigenvector, and this can be done in [math]\displaystyle{ \,O(n^2) }[/math] time), solving sparse PCA is NP hard (since sparse PCA is a particular case of the sparse generalized eigenvalue problem).

The main part of this paper begins by formulating the sparse PCA (SPCA) problem, whose algorithm is based on the representation of PCA as a regression-type optimization problem (Zou et al., 2006) that allows the application of the LASSO (Tibshirani, 1996) (which is a penalization technique based on the [math]\displaystyle{ \,l_1 }[/math] norm). The main part of this paper then derives an approximate greedy algorithm for computing an approximate full set of good solutions with total complexity [math]\displaystyle{ \,O(n^3) }[/math]. It also formulates a convex relaxation for sparse PCA and uses it to derive tractable sufficient conditions for a vector [math]\displaystyle{ \,x }[/math] to be a global optimum of the above optimization problem. In the general approach to SPCA described in this paper, for a given vector [math]\displaystyle{ \,x }[/math] having support [math]\displaystyle{ \,I }[/math], [math]\displaystyle{ \,x }[/math] can be tested to see if it is a globally optimal solution to the above optimization problem simply by performing a few steps of binary search to solve a one-dimensional convex minimization problem.

Notation

For a vector [math]\displaystyle{ \,x \in\mathbb{R}^n }[/math], [math]\displaystyle{ \|x\|_1=\sum_{i=1}^n |x_i| }[/math] and [math]\displaystyle{ \textbf{Card}(x) }[/math] is the cardinality of [math]\displaystyle{ \,x }[/math] (the number of non-zero coefficients of [math]\displaystyle{ \,x }[/math]).
The support [math]\displaystyle{ \,I }[/math] of [math]\displaystyle{ \,x }[/math] is the set [math]\displaystyle{ \{i: x_i \neq 0\} }[/math] and [math]\displaystyle{ \,I^c }[/math] denotes its complement.
[math]\displaystyle{ \,\beta_{+} = \max\{\beta , 0\} }[/math].
For a symmetric [math]\displaystyle{ n \times n }[/math] matrix [math]\displaystyle{ \,X }[/math] with eigenvalues [math]\displaystyle{ \,\lambda_i }[/math], [math]\displaystyle{ \operatorname{Tr}(X)_{+}=\sum_{i=1}^{n}\max\{\lambda_i,0\} }[/math].
The vector of all ones is written [math]\displaystyle{ \textbf{1} }[/math], and the identity matrix is written [math]\displaystyle{ \,\textbf{I} }[/math]. The diagonal matrix with the vector [math]\displaystyle{ \,u }[/math] on the diagonal is written [math]\displaystyle{ \textbf{diag}(u) }[/math].
For [math]\displaystyle{ \Sigma \, }[/math], a symmetric [math]\displaystyle{ n \times n }[/math] matrix, we can define [math]\displaystyle{ \phi(\rho) = \max_{\|x\| \leq 1} x^T \Sigma x - \rho \textbf{Card}(x) }[/math].

Sparse PCA

Sparse PCA problem can be written as:

[math]\displaystyle{ \;\phi(\rho) }[/math]
[math]\displaystyle{ \textrm{subject} \; \textrm{to} \; \|x\|_2=1 }[/math]

It is assumed that we have a square root [math]\displaystyle{ \,A }[/math] of [math]\displaystyle{ \,\Sigma }[/math] with [math]\displaystyle{ \,\Sigma = A^TA }[/math], where [math]\displaystyle{ \,A \in R^{n \times n} }[/math].

The above problem is directly related to the following problem which involves finding a cardinality-constrained maximum eigenvalue:

[math]\displaystyle{ \max_x \; x^{T}{\Sigma}x }[/math]

[math]\displaystyle{ \textrm{subject} \; \textrm{to} \; \|x\|_2=1,\,\,\,\,\,\,\,\,\,\,\,\,\,(1) }[/math]

[math]\displaystyle{ \textbf{Card}(x)\leq k. }[/math]

, in the variable [math]\displaystyle{ \,x \in R^n }[/math]. Without loss of generality, assume that [math]\displaystyle{ \Sigma \geq 0 }[/math] and the features are ordered in decreasing size of variance, i.e. [math]\displaystyle{ \Sigma_{11} \geq \dots \geq \Sigma_{nn} }[/math].

Using duality, we can bound the solution of [math]\displaystyle{ \,(1) }[/math] by:

[math]\displaystyle{ \inf_{\rho \in P}\phi(\rho)+\rho k }[/math]

where [math]\displaystyle{ \,P }[/math] is the set of penalty values for which [math]\displaystyle{ \,\phi(\rho) }[/math] has been computed. This tells us that if we prove [math]\displaystyle{ \,x }[/math] is optimal for [math]\displaystyle{ \,\phi(\rho) }[/math] then [math]\displaystyle{ \,x }[/math] is the global optimum for [math]\displaystyle{ \,(1) }[/math], with the cardinality of [math]\displaystyle{ \,x }[/math] being exactly [math]\displaystyle{ \,k }[/math].

Next, since [math]\displaystyle{ x^T \Sigma x\leq \Sigma_{11}(\sum_{i=1}^n|x_i|)^2 }[/math] and [math]\displaystyle{ (\sum_{i=1}^n|x_i|)^2 \leq \|x\|^2\textbf{Card}(x) \; \forall x \in R^n }[/math] and using the above lemma while assuming [math]\displaystyle{ \rho \geq \Sigma_{11} }[/math], the following will be achieved:

[math]\displaystyle{ \phi(\rho)=\textrm{max}_{\|x\| \le 1} \; x^{T}{\Sigma}x-\rho\textbf{Card}^{2}(x) }[/math]

[math]\displaystyle{ \leq (\Sigma_{11}-\rho)\textbf{Card}(x) }[/math]

[math]\displaystyle{ \leq 0 }[/math]

.

When [math]\displaystyle{ \,\rho \ge \Sigma_{11} }[/math], the optimal solution to the SPCA problem is simply [math]\displaystyle{ \,x = 0 }[/math].

Now we shall look at the case when [math]\displaystyle{ \,\rho \le \Sigma_{11} }[/math]. In this case, the inequality [math]\displaystyle{ \,\|x\| \le 1 }[/math] is tight. Using the fact that the sparsity pattern of a vector [math]\displaystyle{ \,x }[/math] can be represented by a vector [math]\displaystyle{ \,u \in \{0, 1\}^n }[/math], the fact that [math]\displaystyle{ \,\textbf{diag}(u)^2 = \textbf{diag}(u) }[/math] for all variables [math]\displaystyle{ \,u \in \{0, 1\}^n }[/math], and the fact that for any matrix [math]\displaystyle{ \,B }[/math], [math]\displaystyle{ \,\lambda_{max}(B^TB) = \lambda_{max}(BB^T) }[/math], the SPCA problem can be re-expressed as:

[math]\displaystyle{ \,\phi(\rho) \; = \max_{u \in \{0,1\}^n} \; \lambda_{max}(\textbf{diag}(u) \; \Sigma \; \textbf{diag}(u)) - \rho\textbf{1}^Tu }[/math]

[math]\displaystyle{ = \max_{u \in \{0,1\}^n} \; \lambda_{max}(\textbf{diag}(u) \; A^TA \; \textbf{diag}(u)) - \rho\textbf{1}^Tu }[/math]

[math]\displaystyle{ = \max_{u \in \{0,1\}^n} \; \lambda_{max}(A \; \textbf{diag}(u) \; A^T) - \rho\textbf{1}^Tu }[/math]

[math]\displaystyle{ = \max_{ \|x\| = 1} \; \max_{u \in \{0,1\}^n} x^T A \; \textbf{diag}(u) \; A^T x - \rho\textbf{1}^Tu }[/math]

[math]\displaystyle{ = \max_{ \|x\| = 1} \; \max_{u \in \{0,1\}^n} \sum_{i=1}^n u_i((a_i^T x)^2 - \rho) }[/math].

Then, maximizing in [math]\displaystyle{ \,u }[/math] and using the fact that [math]\displaystyle{ \,max_{v \in \{0,1\}} \beta v = \beta_+ }[/math], the SPCA problem in the case when [math]\displaystyle{ \,\rho \le \Sigma_{11} }[/math] becomes:

[math]\displaystyle{ \phi(\rho)= \max_{\|x\|=1}\sum_{i=1}^n((a_i^Tx)^2-\rho)_{+} }[/math]

, which is a non-convex problem in [math]\displaystyle{ \,x \in R^n }[/math]. Note that, in this non-convex problem, we only need to select the values [math]\displaystyle{ \,i }[/math] at which [math]\displaystyle{ \,(a_i^T x)^2 - \rho \gt 0 }[/math].

Here, the [math]\displaystyle{ \,a_i }[/math]'s are the elements of a matrix [math]\displaystyle{ \,A }[/math] (which is mentioned above) such that [math]\displaystyle{ \,A^T A = \Sigma }[/math] (i.e. [math]\displaystyle{ \,A }[/math] is the square root of [math]\displaystyle{ \,\Sigma }[/math]).

For more detail refer to <ref name= "afl" > Alexandre d'Aspremont, Francis Bach, and Laurent El Ghaoui. 2008. Optimal Solutions for Sparse Principal Component Analysis. J. Mach. Learn. Res. 9 (June 2008), 1269-1294. </ref>.

Greedy Solution

Before presenting their approximate greedy search algorithm for solving the SPCA problem, the authors first presented the full greedy search algorithm that follows directly from Moghaddam et al. (2006a). This algorithm starts from an initial solution (having cardinality one) at [math]\displaystyle{ \,\rho = \Sigma_{11} }[/math], and then it updates an increasing sequence of index sets [math]\displaystyle{ \,I_k \subseteq [1, n] }[/math] by scanning all the remaining variables to find the index that gives the maximum contribution in terms of variance.

The following pseudo-code (taken from the authors' paper listed in Reference) summarizes this full greedy search algorithm:

File:c3f1.jpg

At every step, [math]\displaystyle{ \,I_k }[/math] represents the set of non-zero elements, or the sparsity pattern, of the current point. Given [math]\displaystyle{ \,I_k }[/math], the solution to the SPCA problem can be defined as [math]\displaystyle{ \,x_k = \underset{\{x_{I_k^c} = 0, \|x\| = 1\}}{\operatorname{argmax}} x^T \Sigma x - \rho k }[/math], i.e. [math]\displaystyle{ \,x_k }[/math] is formed simply by padding zeros to the leading eigenvector of the sub-matrix [math]\displaystyle{ \,\Sigma_{I_k,I_k} }[/math].

As estimating [math]\displaystyle{ \,n-k }[/math] eigenvalues at each iteration is costly, we get help from the fact that [math]\displaystyle{ \,uu^T }[/math] is a sub-gradient of [math]\displaystyle{ \,\lambda_{max} }[/math] at [math]\displaystyle{ \,X }[/math] if [math]\displaystyle{ \,u }[/math] is a leading eigenvector of [math]\displaystyle{ \,X }[/math] to get [math]\displaystyle{ \lambda_{max}(\sum_{j\in I_k\cup \{i\}}a_ja_j^T)\geq \lambda_{max}(\sum_{j\in I_k}a_ja_j^T)+(x_k^Ta_i)^2 }[/math]. With this, the authors had a lower bound on the objective which does not require finding [math]\displaystyle{ \,n - k }[/math] eigenvalues at each iteration.

The authors then derived the following algorithm for solving the SPCA problem:

Approximate Greedy Search Algorithm

Input: [math]\displaystyle{ \Sigma \in \textbf{R}^{n\times n} }[/math]

Algorithm:

1.Preprocessing: sort variables decreasingly diagonal elements and permute elements of [math]\displaystyle{ \Sigma }[/math] accordingly. Compute Cholesky decomposition [math]\displaystyle{ \,\Sigma =A^TA }[/math].

2.Initialization:[math]\displaystyle{ I_1=\{\}, x_1=a_1/\|a_1\| }[/math]

3.Compute [math]\displaystyle{ i_k= {\arg\max}_{i\notin I_k}(x_k^Ta_i)^2 }[/math]

4.Set [math]\displaystyle{ I_{k+1}=I_k\cup\{i_k\} }[/math] and compute [math]\displaystyle{ \,x_{k+1} }[/math] as the leading eigenvector of [math]\displaystyle{ \sum_{j\in I_{k+1}}a_j a_j^T }[/math]

5.Set [math]\displaystyle{ \,k=k+1 }[/math] if [math]\displaystyle{ \,k\lt n }[/math] go back to step 3.

Output: sparsity patterns [math]\displaystyle{ \,I_k }[/math]

As in the full greedy search algorithm, at every step, [math]\displaystyle{ \,I_k }[/math] represents the set of non-zero elements, or the sparsity pattern, of the current point and, given [math]\displaystyle{ \,I_k }[/math], the solution to the SPCA problem can be defined as [math]\displaystyle{ \,x_k = \underset{\{x_{I_k^c} = 0, \|x\| = 1\}}{\operatorname{argmax}} x^T \Sigma x - \rho k }[/math], i.e. we form [math]\displaystyle{ \,x_k }[/math] simply by padding zeros to the leading eigenvector of the sub-matrix [math]\displaystyle{ \,\Sigma_{I_k,I_k} }[/math].

Computational Complexity

The full greedy search algorithm for solving the SPCA problem has a complexity of [math]\displaystyle{ \,O(n^4) }[/math] because, at each step [math]\displaystyle{ \,k }[/math], it computes [math]\displaystyle{ \,n-k }[/math] maximum eigenvalues of matrices having size [math]\displaystyle{ \,k }[/math]. On the other hand, the authors' approximate greedy search algorithm for solving the SPCA problem has a complexity of [math]\displaystyle{ \,O(n^3) }[/math]. This is because the first Cholesky decomposition has a complexity of [math]\displaystyle{ \,O(n^3) }[/math] and, in the [math]\displaystyle{ \,k }[/math]th iteration, there is a complexity of [math]\displaystyle{ \,O(k^2) }[/math] for the maximum eigenvalue problem and a complexity of [math]\displaystyle{ \,O(n^2) }[/math] for finding all products [math]\displaystyle{ \,x^T a_j }[/math].

Convex Relaxation

As it is mentioned sparse PCA problem can be written as:

[math]\displaystyle{ \phi(\rho)= \max_{\|x\|=1}\sum_{i=1}^n((a_i^Tx)^2)-\rho)_{+} }[/math]

reformulate the problem by changing [math]\displaystyle{ X=xx^T }[/math],thus we rewrite the equivalant:

[math]\displaystyle{ \phi(\rho)= \max\sum_{i=1}^n((a_i^TXa_i)^2)-\rho)_{+} }[/math]

[math]\displaystyle{ s.t. \textbf{Tr}(X)=1,\textbf{Rank}(X)=1, X\geq 0 }[/math]

As the goal is to maximize a convex function over the convex set,[math]\displaystyle{ \Delta_n=\{X\in S_n : \textbf{Tr}(X)=1, X\geq 0\} }[/math] the solution should be an exterme point of [math]\displaystyle{ \Delta_n }[/math]. unfortunately the adressed problem which we are going to solve is convex in [math]\displaystyle{ X }[/math] and not concave. So the problem is still harf to solve. How ever It is shown on <ref name="afl"/> that on rank one elements of [math]\displaystyle{ \Delta_n }[/math], it is equal to a concave function [math]\displaystyle{ X }[/math] and we use this to produce a semidefinite relaxation of the problem. The proposition is as below and the proof is provided in the paper.

Proposition 1 Let [math]\displaystyle{ A\in{R}^{n\times n}, \rho \geq0 }[/math] and denotes by [math]\displaystyle{ a_1,...,a_n\in R^n }[/math] the columns of [math]\displaystyle{ A }[/math] an upper bound on:

[math]\displaystyle{ \phi(\rho)= \max\sum_{i=1}^n((a_i^TXa_i)^2)-\rho)_{+} }[/math]

[math]\displaystyle{ s.t. \textbf{Tr}(X)=1,\textbf{Rank}(X)=1, X\geq 0 }[/math]

can be computed by solving

[math]\displaystyle{ \psi(\rho)= \max\sum_{i=1}^n(\textbf{Tr}(X^{1/2}B_iX^{1/2})_{+} }[/math]

[math]\displaystyle{ s.t. \textbf{Tr}(X)=1, X\geq 0 }[/math]

in the variable [math]\displaystyle{ X\in S_n }[/math]where [math]\displaystyle{ B_i=a_ia_i^T-\rho I }[/math] or also:

[math]\displaystyle{ \psi(\rho)= \max\sum_{i=1}^n(\textbf{Tr}(P_iB_i)_{+} }[/math]

[math]\displaystyle{ s.t. \textbf{Tr}(X)=1, X\geq 0, X\geq P_i \geq 0 }[/math]

which is semidefinite programming in the variables [math]\displaystyle{ X\in S_n, P_i\in S_n }[/math]

Application

In this section we mention the application of sparse PCA to subset selection. One of the other application is compressed sensing which you can see more detail about it on the main paper:

Subset selection

we consider [math]\displaystyle{ p }[/math] datapoints in [math]\displaystyle{ R^n }[/math] in a data matrix [math]\displaystyle{ X \in R^{p\times n} }[/math]. We are given real numbers [math]\displaystyle{ y \in R^p }[/math] to predict from [math]\displaystyle{ X }[/math] using linear regression, estimated by least squares. In subset selection problem we are looking for sparse coefficient [math]\displaystyle{ w }[/math] i.e a vector [math]\displaystyle{ w }[/math] with many zeros. we thus consider the problem:

[math]\displaystyle{ s(k)=\min_{w\in R^n, \textbf{Card}(w)\leq k}\|y-Xw\|^2 }[/math]

using sparsity patter [math]\displaystyle{ u\in \{0,1\} }[/math] and optimizing with respect to [math]\displaystyle{ w }[/math] and rewriting the formula using generalized eigenvalue we thus have:

[math]\displaystyle{ s(k)=\|y\|^2- \max_{u\in \{0,1\}, \textbf{1}^Tu\leq k}\max_{w\in R^n}\frac{w^T\textbf{diag}(u)X^Tyy^TX\textbf{diag}(u)w}{w^TX(u)^TX(u)w} }[/math]

the simple bound on the optimal value of the subset selection problem with the following form:

[math]\displaystyle{ w^T(X(v)^Tyy^TX(v)-s_0X(v)^TX(v))w\leq B }[/math]

where [math]\displaystyle{ B\geq0 }[/math], thus we have :

[math]\displaystyle{ \|y\|^2-s_0\geq(k)\geq\|y\|^2-s_0-B(\min_{v\in\{0,1\}^n,1^Tv=k}\lambda_{min}(X(v)^TX(v)))^{-1} }[/math]

[math]\displaystyle{ \geq\|y\|^2-s_0-B(\lambda_{min}(X^TX))^{-1} }[/math]

This bound gives a sufficient condition for optimality in subset selection, for any problem instances and any given subset.

Conclusion

Here present a new convex relaxation of sparse principal component analysis. The sufficient conditions are derived for optimality <ref name="afl"/>. Those conditions go together with efficient greedy algorithms that provide candidate solutions, many of which turn out to be optimal in practice. Finally the resulting upper bound has direct application to problems such as sparse recovery and subset selection.

Reference