# Difference between revisions of "optimal Solutions forSparse Principal Component Analysis"

## Introduction

Principal component analysis (PCA) is a method for finding linear combinations of features, called principal components, that correspond to the directions of maximum variance in the data, which are orthogonal to one another. In practice, performing PCA on a data set involves applying the singular value decomposition to the data matrix.

PCA facilitates the interpretation of the data if the components are linear combinations of only a few latent variables, and not many or all of the original ones. This is particularly true in many applications in which the coordinate axes that correspond to the factors have a direct physical interpretation; for instance, in financial or biological applications, each axis might correspond to a specific asset or to a specific gene. Constraining the number of non-zero factor coefficients (loadings) in sparse principal components to a very low number relative to the total number of coefficients whilst having these sparse vectors explain a maximum amount of variance in the data is known as sparse PCA. In other words, sparse PCA is an extension of PCA method that attempts to maintain a trade-off between statistical fidelity and interpretability by computing principal components that can be represented using the least number of coefficients (in linear combination) while preserving as much data variation as possible. Sparse PCA has many applications in biology, finance and many machine learning problems. Sparse principal components, like principal components, are vectors that span a lower-dimensional space that explain most of variance in the original data. However, in order to find the sparse principal components using sparse PCA, it is necessary to make some sacrifices:

• There is a reduction in the explained variance in the original data captured by the sparse principal components as compared to PCA.
• There is a reduction in the orthogonality (independence or correlation) between the resulting variables (sparse principal components) as compared to PCA.

In this paper we are going to focus on the problem of sparse PCA which can be written as:

$\max_x \; x^{T}{\Sigma}x-\rho\textbf{Card}(x)$
$\textrm{subject} \; \textrm{to} \; \|x\|_2 \le 1$

where:

• $x\in \mathbb{R}^n$
• $\Sigma \in S_n$ is the symmetric positive semidefinite sample covariance matrix
• $\,\rho$ is the parameter which controls the sparsity
• $\textbf{Card}(x)$ expresses the cardinality ($\,l_0$ norm) of $\,x$.

Note that while solving the standard PCA problem is not complicated (since, for each factor, one simply needs to find a leading eigenvector, and this can be done in $\,O(n^2)$ time), solving sparse PCA is NP hard (since sparse PCA is a particular case of the sparse generalized eigenvalue problem).

The paper begins by formulating the sparse PCA (SPCA) problem, whose algorithm is based on the representation of PCA as a regression-type optimization problem (Zou et al., 2006) that allows the application of the LASSO (Tibshirani, 1996) (which is a penalization technique based on the $\,l_1$ norm). The $l_0$ goal of the cardinality and the $l_1$ solution using LASSO are actually fundamentally related. The conditions for guaranteeing sparse variable selection or recovery using the $l_1$ norm for solving $l_0$ problems are based on the restricted isometry property. This connection was established recently by Candes and Tao<ref name="candes2005">E. J. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory. 51(12):4203--4215, 2005.</ref> and Donoho and Tanner<ref name="donoho2005">D. L. Donoho and J. Tanner. Sparse nonnegative solutions of underdetermined linear equations by linear programming. Proceedings of the National Academy of Sciences, 102(27):9446--9451, 2005.</ref>, among others. Interestingly, it also serves as the basis for compressed sensing, another topic we have seen in this course (see here and here).

The main part of this paper then derives an approximate greedy algorithm for computing an approximate full set of good solutions with total complexity $\,O(n^3)$. It also formulates a convex relaxation for sparse PCA and uses it to derive tractable sufficient conditions for a vector $\,x$ to be a global optimum of the above optimization problem. In the general approach to SPCA described in this paper, for a given vector $\,x$ having support $\,I$, $\,x$ can be tested to see if it is a globally optimal solution to the above optimization problem simply by performing a few steps of binary search to solve a one-dimensional convex minimization problem.

## Notation

• For a vector $\,x \in\mathbb{R}^n$, $\|x\|_1=\sum_{i=1}^n |x_i|$ and $\textbf{Card}(x)$ is the cardinality of $\,x$ (the number of non-zero coefficients of $\,x$).
• The support $\,I$ of $\,x$ is the set $\{i: x_i \neq 0\}$ and $\,I^c$ denotes its complement.
• $\,\beta_{+} = \max\{\beta , 0\}$.
• For a symmetric $n \times n$ matrix $\,X$ with eigenvalues $\,\lambda_i$, $\operatorname{Tr}(X)_{+}=\sum_{i=1}^{n}\max\{\lambda_i,0\}$.
• The vector of all ones is written $\textbf{1}$, and the identity matrix is written $\,\textbf{I}$. The diagonal matrix with the vector $\,u$ on the diagonal is written $\textbf{diag}(u)$.
• For $\Sigma \,$, a symmetric $n \times n$ matrix, we can define $\phi(\rho) = \max_{\|x\| \leq 1} x^T \Sigma x - \rho \textbf{Card}(x)$.

## Sparse PCA

The sparse PCA problem can be written as:

$\;\phi(\rho)$
$\textrm{subject} \; \textrm{to} \; \|x\|_2=1.$

Since $\,\Sigma \in S_{n}$, $\,\Sigma$ has a square root. Let $\,A \in R^{n \times n}$ denote the square root of $\,\Sigma$ where $\,\Sigma = A^TA$.

The above problem is directly related to the following problem which involves finding a cardinality-constrained maximum eigenvalue:

$\max_x \; x^{T}{\Sigma}x$

$\textrm{subject} \; \textrm{to} \; \|x\|_2=1,\,\,\,\,\,\,\,\,\,\,\,\,\,(1)$

$\textbf{Card}(x)\leq k,$

in the variable $\,x \in R^n$. Suppose the features of $\,\Sigma$ are ordered in decreasing size of variance, i.e. $\Sigma_{11} \geq \dots \geq \Sigma_{nn}$.

Using duality, we can bound the solution of $\,(1)$ by:

$\inf_{\rho \in P}\phi(\rho)+\rho k$

where $\,P$ is the set of penalty values for which $\,\phi(\rho)$ has been computed. This tells us that if we prove $\,x$ is optimal for $\,\phi(\rho)$ then $\,x$ is the global optimum for $\,(1)$, with the cardinality of $\,x$ being exactly $\,k$.

For the remainder of the paper, the authors assume $\rho \leq \Sigma_{11}$. To see why this is the case, we will assume for the moment that $\rho \geq \Sigma_{11}$. Then, since $x^T \Sigma x\leq \Sigma_{11}(\sum_{i=1}^n|x_i|)^2$ and $(\sum_{i=1}^n|x_i|)^2 \leq \|x\|^2\textbf{Card}(x) \; \forall x \in R^n$ we get the following:

$\phi(\rho)=\textrm{max}_{\|x\| \le 1} \; x^{T}{\Sigma}x-\rho\textbf{Card}^{2}(x)$

$\leq (\Sigma_{11}-\rho)\textbf{Card}(x)$

$\leq 0.$

This implies that the optimal solution to the SPCA problem is simply $\,x = 0$. Thus, we assume $\,\rho \leq \Sigma_{11}$ and in this case the inequality $\,\|x\| \le 1$ is tight.

Using the fact that the sparsity pattern of a vector $\,x$ can be represented by a vector $\,u \in \{0, 1\}^n$, the fact that $\,\textbf{diag}(u)^2 = \textbf{diag}(u)$ for all variables $\,u \in \{0, 1\}^n$, and the fact that for any matrix $\,B$, $\,\lambda_{max}(B^TB) = \lambda_{max}(BB^T)$, the SPCA problem can be re-expressed as:

$\,\phi(\rho) \; = \max_{u \in \{0,1\}^n} \; \lambda_{max}(\textbf{diag}(u) \; \Sigma \; \textbf{diag}(u)) - \rho\textbf{1}^Tu$

$= \max_{u \in \{0,1\}^n} \; \lambda_{max}(\textbf{diag}(u) \; A^TA \; \textbf{diag}(u)) - \rho\textbf{1}^Tu$
$= \max_{u \in \{0,1\}^n} \; \lambda_{max}(A \; \textbf{diag}(u) \; A^T) - \rho\textbf{1}^Tu$
$= \max_{ \|x\| = 1} \; \max_{u \in \{0,1\}^n} x^T A \; \textbf{diag}(u) \; A^T x - \rho\textbf{1}^Tu$
$= \max_{ \|x\| = 1} \; \max_{u \in \{0,1\}^n} \sum_{i=1}^n u_i((a_i^T x)^2 - \rho)$.

Then, if we maximize in $\,u$ and use the fact that $\,max_{v \in \{0,1\}} \beta v = \beta_+$, the SPCA problem, in the case where $\,\rho \le \Sigma_{11}$, becomes:

$\phi(\rho)= \max_{\|x\|=1}\sum_{i=1}^n((a_i^Tx)^2-\rho)_{+},$
which is a non-convex problem in $\,x \in R^n$. Note that, in this non-convex problem, we only need to select the values $\,i$ at which $\,(a_i^T x)^2 - \rho \gt 0$.

Here, the $\,a_i$'s are the columns of the matrix $\,A$ where $\,A^T A = \Sigma$ (i.e. $\,A$ is the square root of $\,\Sigma$).

For more details refer to <ref name= "afl" > Alexandre d'Aspremont, Francis Bach, and Laurent El Ghaoui. Optimal Solutions for Sparse Principal Component Analysis. J. Mach. Learn. Res. 9 (June 2008), 1269-1294. </ref>.

## Greedy Solution

Before presenting their approximate greedy search algorithm for solving the SPCA problem, the authors first presented the full greedy search algorithm which follows directly from Moghaddam et al. <ref name="M2006a">B. Moghaddam, Y. Weiss, and S. Avidan. Spectral bounds for sparse PCA: Exact and greedy algorithms. Advances in Neural Information Processing Systems, 18, 2006.</ref>. This algorithm starts from an initial solution (having cardinality one) at $\,\rho = \Sigma_{11}$, and then it updates an increasing sequence of index sets $\,I_k \subseteq [1, n]$ by scanning all the remaining variables to find the index that gives the maximum contribution in terms of variance.

The following pseudo-code (taken from <ref name = "afl"/>) summarizes this full greedy search algorithm:

At every step, $\,I_k$ represents the set of non-zero elements, or the sparsity pattern, of the current point. Given $\,I_k$, the solution to the SPCA problem can be defined as $\,x_k = \underset{\{x_{I_k^c} = 0, \|x\| = 1\}}{\operatorname{argmax}} x^T \Sigma x - \rho k$, i.e. $\,x_k$ is formed simply by padding zeros to the leading eigenvector of the sub-matrix $\,\Sigma_{I_k,I_k}$.

Since estimating $\,n-k$ eigenvalues at each iteration is costly, we can use the fact that $\,uu^T$ is a sub-gradient of $\,\lambda_{max}$ at $\,X$ if $\,u$ is a leading eigenvector of $\,X$ to get $\lambda_{max}(\sum_{j\in I_k\cup \{i\}}a_ja_j^T)\geq \lambda_{max}(\sum_{j\in I_k}a_ja_j^T)+(x_k^Ta_i)^2$ Proof.

With this, the authors have a lower bound on the objective which does not require finding $\,n - k$ eigenvalues at each iteration.

The authors then derive the following algorithm for solving the SPCA problem:

### Approximate Greedy Search Algorithm

Input: $\Sigma \in \textbf{R}^{n\times n}$

Algorithm:

1.Preprocessing: sort variables decreasingly diagonal elements and permute elements of $\Sigma$ accordingly. Compute Cholesky decomposition $\,\Sigma =A^TA$.

2.Initialization:$I_1=\{\}, x_1=a_1/\|a_1\|$

3.Compute $i_k= {\arg\max}_{i\notin I_k}(x_k^Ta_i)^2$

4.Set $I_{k+1}=I_k\cup\{i_k\}$ and compute $\,x_{k+1}$ as the leading eigenvector of $\sum_{j\in I_{k+1}}a_j a_j^T$

5.Set $\,k=k+1$ if $\,k\lt n$ go back to step 3.

Output: sparsity patterns $\,I_k$

As in the full greedy search algorithm, at every step, $\,I_k$ represents the set of non-zero elements, or the sparsity pattern, of the current point and, given $\,I_k$, the solution to the SPCA problem can be defined as $\,x_k = \underset{\{x_{I_k^c} = 0, \|x\| = 1\}}{\operatorname{argmax}} x^T \Sigma x - \rho k$, i.e. we form $\,x_k$ simply by padding zeros to the leading eigenvector of the sub-matrix $\,\Sigma_{I_k,I_k}$.

### Computational Complexity

The full greedy search algorithm for solving the SPCA problem has a complexity of $\,O(n^4)$ because, at each step $\,k$, it computes $\,n-k$ maximum eigenvalues of matrices having size $\,k$. On the other hand, the authors' approximate greedy search algorithm for solving the SPCA problem has a complexity of $\,O(n^3)$. This is because the first Cholesky decomposition has a complexity of $\,O(n^3)$ and, in the $\,k$th iteration, there is a complexity of $\,O(k^2)$ for the maximum eigenvalue problem and a complexity of $\,O(n^2)$ for finding all products $\,x^T a_j$.

## Convex Relaxation

As mentioned above, the sparse PCA problem can be written as:

$\phi(\rho)= \max_{\|x\|=1}\sum_{i=1}^n((a_i^Tx)^2-\rho)_{+}$
.

Because the variable $\,x$ only appears through $\,X = xx^T$, the above form of the SPCA problem can be reformulated in terms of only $\,X$ by using the fact that, when $\,\|x\| = 1$, $\,X = xx^T$ is equivalent to $\,\textbf{Tr}(X) = 1$, $\,X \ge 0$ and $\,\textbf{Rank}(X) = 1$.

Thus, the authors obtained the following form of the SPCA problem:

$\phi(\rho)= \max\sum_{i=1}^n((a_i^TXa_i)^2 -\rho)_{+} \;\; s.t. \; \textbf{Tr}(X)=1,\textbf{Rank}(X)=1, X\geq 0$

As the goal of the above form of the SPCA problem is to maximize a convex function over the convex set (http://en.wikipedia.org/wiki/Spectrahedron spectahedron]) $\,\Delta_n=\{X\in S_n : \textbf{Tr}(X)=1, X\geq 0\}$, the solution must be an extreme point of $\,\Delta_n$ and is therefore a rank-one matrix. Unfortunately, this form of the SPCA problem is convex in $\,X$ and not concave, so the problem is still hard to solve. However, it is shown in <ref name="afl"/> that, on rank-one elements of $\,\Delta_n$, this form of the SPCA problem is equal to a concave function of $X$. Using this fact, the authors produced a convex relaxation of this form of the SPCA problem.

The proposition is given below and the proof is provided in the authors' paper listed in Reference.

Proposition 1 Let $A\in{R}^{n\times n}, \rho \geq0$ and denotes by $a_1,...,a_n\in R^n$ the columns of $A$, an upper bound on:

$\phi(\rho)= \max\sum_{i=1}^n((a_i^TXa_i)^2)-\rho)_{+}$
$s.t. \; \textbf{Tr}(X)=1,\textbf{Rank}(X)=1, X\geq 0$

can be computed by solving

$\psi(\rho)= \max\sum_{i=1}^n(\textbf{Tr}(X^{1/2}B_iX^{1/2})_{+}$
$s.t. \; \textbf{Tr}(X)=1, X\geq 0$

in the variable $X\in S_n$, where $B_i=a_ia_i^T-\rho I$ or also:

$\psi(\rho)= \max\sum_{i=1}^n(\textbf{Tr}(P_iB_i)_{+}$
$s.t. \; \textbf{Tr}(X)=1, X\geq 0, X\geq P_i \geq 0$

, which is a semi-definite program in the variables $X\in S_n, P_i\in S_n$.

It is always true that $\,\psi(\rho) \ge \phi(\rho)$, and, when the solution to the above semi-definite program has rank one, we have that $\,\psi(\rho) = \phi(\rho)$ and the convex relaxation ( which is $\psi(\rho)=\max\sum_{i=1}^n(\textbf{Tr}(P_iB_i)_{+} \;\; s.t. \; \textbf{Tr}(X)=1, X\geq 0, X\geq P_i \geq 0$ ) is tight.

## Optimality Conditions

In this section, the optimality conditions considered by the authors are briefly discussed.

### Dual problem and optimality conditions

The authors first derived the dual problem to the convex relaxation of SPCA ( which is $\psi(\rho)=\max\sum_{i=1}^n(\textbf{Tr}(P_iB_i)_{+} \;\; s.t. \; \textbf{Tr}(X)=1, X\geq 0, X\geq P_i \geq 0$ ) as well as the associated Karush-Kuhn-Tucker (KKT) optimality conditions. They began by presenting the following lemma (taken from the authors' paper listed in References and whose proof is also given in that paper):

### Optimality conditions for rank one solutions

The authors then derived the KKT conditions for the convex relaxation of SPCA for the particular case in which we have a rank one candidate solution $\,X = xx^T$ and we need to test our candidate solution's optimality. The following lemma (taken from the authors' paper listed in References and whose proof is also given in that paper) then provides an important connection between the convex relaxation of SPCA and the original non-convex form of the SPCA problem (which is $\;\phi(\rho) \;\; \textrm{subject} \; \textrm{to} \; \|x\|_2=1$):

Based on the necessary and sufficient optimality conditions for the convex relaxation of SPCA as given in lemma 2, lemma 3 gives that, for any candidate vector $\,x$, we can test the optimality of $\,X = xx^T$ for the convex relaxation of SPCA by solving a semi-definite feasibility problem in the variables $\,Y_i \in S_n$, and that, if the rank one candidate solution $X = xx^T$ is optimal for the convex relaxation of SPCA, then $\,x$ is globally optimal for the original non-convex combinatorial form of SPCA ( which is $\;\phi(\rho) \;\; \textrm{subject} \; \textrm{to} \; \|x\|_2=1$ ).

### Solution improvements and randomization

If the conditions are not met, then the rank of the convex relaxation of SPCA's optimal solution would be strictly greater than one, and hence the convex relaxation of SPCA would not be tight. If this is the case, then a different relaxation such as DSPCA by d’Aspremont et al. (2007b) (more details regarding it is available in d’Aspremont et al.'s paper) may be used to try to get a better solution for SPCA. Furthermore, following Ben-Tal and Nemirovski (2002), randomization techniques may also be applied to improve the quality of the convex relaxation of SPCA's solution.

## Application

In this section we mention the application of sparse PCA to subset selection. One of the other application is compressed sensing which you can see more detail about it on the main paper:

### Subset selection

we consider $\,p$ datapoints in $\,R^n$ in a data matrix $X \in R^{p\times n}$. We are given real numbers $y \in R^p$ to predict from $\,X$ using linear regression, estimated by least squares. In subset selection problem we are looking for sparse coefficients $\,w$, i.e a vector $\,w$ with many zeros in its entries. We thus consider the problem:

$s(k)=\min_{w\in R^n, \textbf{Card}(w)\leq k}\|y-Xw\|^2$

Using sparsity pattern $u\in \{0,1\}$ and optimizing with respect to $\,w$ and rewriting the formula using generalized eigenvalue we thus have:

$s(k)=\|y\|^2- \max_{u\in \{0,1\}, \textbf{1}^Tu\leq k}\max_{w\in R^n}\frac{w^T\textbf{diag}(u)X^Tyy^TX\textbf{diag}(u)w}{w^TX(u)^TX(u)w}$

The simple bound on the optimal value of the subset selection problem has the following form:

$w^T(X(v)^Tyy^TX(v)-s_0X(v)^TX(v))w\leq B$

where $B\geq0$, and thus we have:

$\|y\|^2-s_0\geq(k)\geq\|y\|^2-s_0-B(\min_{v\in\{0,1\}^n,1^Tv=k}\lambda_{min}(X(v)^TX(v)))^{-1} \geq\|y\|^2-s_0-B(\lambda_{min}(X^TX))^{-1}$

This bound gives a sufficient condition for optimality in subset selection, for any problem instances and any given subset.

## Conclusion

This paper presented a novel formulation of sparse PCA (SPCA) problem based on a semidefinite relaxation scheme. Using this formulation, a greedy algorithm was developed to compute a full set of good solutions to the SPCA problem. The algorithm was shown to be efficient. i.e. have complexity of $O(n^3)$, and provide candidate solutions, many of which turn out to be optimal in practice. Furthermore, the sufficient conditions for global optimality of candidate solutions were derived <ref name="afl"/>. Finally, the resulting upper bound was shown to have direct application to problems such as sparse recovery and subset selection.