# a Direct Formulation For Sparse PCA Using Semidefinite Programming

## Introduction

• There is a reduction in the explained variance in the original data captured by the sparse principal components as compared to PCA.
• There is a reduction in the orthogonality (independence or correlation) between the resulting variables (sparse principal components) as compared to PCA.

Semidefinite programs<ref name="VB1996">L. Vandenberghe and S. Boyd, Semidefinite programming, SIAM Review, 38 (1996), pp. 49– 95.</ref> are convex optimization problems. They optimize a linear function by constraining it to an affine combination of symmetric matrices which are positive semidefinite. Semidefinite programs can be solved in polynomial time by interior-point solvers like SEDUMI or SDPT3. Unfortunately, they are not viable nor practical for high dimensional data sets.

This paper suggests a direct approach to the convex formulation of the sparse PCA via semidefinite programming. Since interior-point solvers cannot handle large data sets, Nesterov’s smoothing technique <ref name='N2004'>Smoothing technique and its application in semidefinite optimization, CORE Discussion Paper No. 2004/73, (2004).</ref> is used efficiently to help solving large dimensional problems. First order methods require less memory which is an important issue in interior-point solvers. On the other hand, their convergence is slower but this is not a major concern in certain cases. So, the optimal first-order minimization algorithm is going to be applied for solving the optimization problem.

The content of the paper could be summarized as below:

First, the authors tried to maximize the variance projection by limiting the number of non-zero elements via semidefinite programming. Then, they show how this method can be used for decomposing a matrix into a limited number of variables. As their problem size is large and can not be solved by common techniques which are used for solving small convex optimization problems, they show how Nesterov's smoothing method <ref name="N2004"/> helps achieving the solution.

## Semidefinite relaxation

### Sparse variance maximization

Let $\textbf{S}^n$ be the set of symmetric matrices of size $\, n$. Let $A \in \textbf{S}^n$, then WLOG, assume $\,A$ is a covariance matrix. The, we want to find the vector $x \in \textbf{R}^n$ with maximum variance while constraining it to be sparse. Let $\, k$ be an integer where $1 \leq k \leq n$ then the problem can be written formally as

$\textrm{maximize}_x \; x^{T}{A}x$

$\textrm{ subject } \; \textrm{ to }\;\|x\|_2=1,$

$\textbf{Card}(x)\leq k.$

Unfortunately, this problem is NP-hard, because of the cardinality constraint. Therefore, in order to solve the problem, we should look for a convex relaxation.

### Semidefinite relaxation

Rewriting the above formulas, we get:

$\textrm{maximize}\;\textbf{Tr}({A}{X})$

$\textrm{ subject } \; \textrm{ to } \; \textbf{Tr}({X})=1$,

$\textbf{Card}({X})\leq k^2,$

${X}\succeq 0,$

$\textbf{Rank}({X})=1.$

where $X \in \textbf{S}^n.$

If $\, X$ is a solution to the above problem then the last two constraints imply that $\,X=xx^T$. Finally, by replacing the only non-convex constraint, $\textbf{Card}({X})\leq k^2$, by a weaker but convex one, we get a semidefinite relaxation of the sparse PCA problem for square matrices.

$\textrm{maximize}\;\textbf{Tr}({A}{X})$

$\textrm{ subject } \; \textrm{ to } \; \textbf{Tr}({X})=1,$

$\textbf{1}^{T}|{X}|\textbf{1}\leq k,$

${X}\succeq 0.$

Note that if $\,X=xx^T$ then $\textbf{1}^{T}|{X}|\textbf{1}\leq k$ is equivalent to the constraint $\,(\sum|x_i|)^2 \leq k$ or $\,\|x\|^2_1 \leq k$.

The above formulation can be extended to the case of non-square matrices. In this case, the sparse variance maximization problem can be written as follows:

$\textrm{maximize}\;u^{T}{A}v$

$\textrm{ subject } \; \textrm{ to } \; \|u\|_{2}=\|v\|_{2}=1,$

$\textbf{Card}(u)\leq k_{1},\textbf{Card}(v)\leq k_{2}.$

Changing the constraint to the convex one, in the same way as we did for the square case, gives the following formulation:

$\textrm{maximize}\;\textbf{Tr}({A}^{T}{X}^{12})$

$\textrm{ subject } \; \textrm{ to } \; {X}\leq0, \textbf{Tr}({X}^{ii})=1,$

$\textbf{1}^{T}|{X}^{ii}|\textbf{1}\leq k_{i}, i=1,2,$

$\textbf{1}^{T}|{X}^{12}|\textbf{1}\leq \sqrt{k_1k_2}.$

## Sparse decomposition

To obtain the sparse equivalent to PCA we begin by solving the semidefinite relaxation problem. So, given the matrix $A_1 \in \textbf{S}^n$, we solve

$\textrm{maximize} \; \textbf{Tr}({A}_1{X})$

$\textrm{subject} \; \textrm{to} \;\textbf{Tr}({X})=1,$

$\textbf{1}^{T}|{X}|\textbf{1}\leq k,$

${X}\succeq 0.$

From the solution, $\,X$, we keep only the dominant eigenvector $\,x_1$. Then $\,x_1$ is the first sparse loading vector in our decomposition. To obtain the next component define

${A}_2={A}_1-(x_{1}^{T}{A}_1x_1)x_1x_1^T$

We can then use $\,A_2$ as our new data matrix and repeat the optimization problem. We can continue this process iteratively to obtain more components. But the issue is when should we stop?

When the values in $\,A_i$ are all less than a constant $\,\rho$, we must stop. $\,\rho$ can be interpreted as the noise level and when the matrix can not be distinguished from noise, we should terminate the iterations.

## Algorithm

SEDUMI and SDPT3 can be used to solve small semidefinite programming problems. But for larger problems, these methods are not practical since the second order information, necessary in interior-point methods, is impossible to store once the problem grows to a particular size. In this case, first order methods are more appropriate. These methods have a smaller memory cost per iteration but their convergence time is $O(1/\epsilon)$ while interior-point methods will converge in $O(log(1/\epsilon))$. Fortunately, convergence time is not important for our applications.

### Smoothing technique

There are two important issues to consider when solving large scale semidefinite programming problems. First, we must consider the memory restrictions. To solve this problem we 'll consider first-order techniques. The other issue is smoothness. As $X\succeq 0$ is not smooth, the number of iterations required for first-order methods to reach an accuracy of $\epsilon$ is given by $O(1/\epsilon ^2)$. This bound is tight without any further structural information about the structure. Since we have additional information available this complexity bound will be brought down to $O(1/\epsilon)$.

As we have in the previous sections the optimization problem is:

$\textrm{maximize} \; x^{T}{A}x$

$\textrm{subject} \; \textrm{to} \; \|x\|_2=1,$

$\textbf{Card}(x)\leq k.$

We can rewrite the optimization problem by replacing the cardinality constraint with a penalty term in the objective function:

$\textrm{maximize} \; x^{T}{A}x-\rho\textbf{Card}^{2}(x)$
$\textrm{subject} \; \textrm{to} \; \|x\|_2=1$

which is equivalent to:

$\textrm{maximize} \; \textbf{Tr}({A}{X})-\rho\textbf{Card}({X})$

$\textrm{subject} \; \textrm{to} \; \textbf{Tr}({X})=1,$

${X}\succeq 0,$

$\textbf{Rank}({X})=1.$

By changing the problem to a semidefinite programming problem have:

$\textrm{maximize} \; \textbf{Tr}({A}{X})-\rho\textbf{1}^{T}|{X}|\textbf{1}$

$\textrm{subject} \; \textrm{to} \; \textbf{Tr}({X})=1,$

${X}\succeq 0,$

which we can rewrite as:

$\max _{{X}\succeq 0, \textbf{Tr}({X})=1} \min _{|{U}_{ij}\leq \rho|}\textbf{Tr}({X}({A}+{U})),$

which can be written in saddle-function format as:

$f(x)=\hat{f}(x)+\max_{u}\{\lt \textbf{T}x,u\gt -\hat{\phi}(u) : u \in \textbf{Q}_2\},$

where $\,f$ is defined over a compact convex set $\,Q_1$ and $\,\hat{f}$ is convex and differentiable. The function $\hat{\phi} (U)$ is a continuous convex function over $\,Q_2$. When a function can be written in a saddle-function format, then http://en.wikipedia.org/wiki/Smooth_function smoothing] techniques will help reduce the complexity of solving

$\min _{x\in \textbf{Q}_1}f(x)$

from $O(1/\epsilon ^2)$ to $O(1/\epsilon)$.

This process can be done in two steps.

1. Regularization: Regularization adds a strongly convex penalty to the saddle function and then estimate its $\epsilon-$approximation of $f$ with Lipschitz continuous gradient.
2. Optimal first-order minimization: It applies the optimal first-order scheme for Lipschitz continuous functions.

### Application of sparse PCA

We consider the problem which is discussed in the previous section:

$\textrm{maximize} \; \textbf{Tr}({A}{X})-\textbf{1}^{T}|{X}|\textbf{1}$

$\textrm{subject} \; \textrm{to} \; \textbf{Tr}({X})=1,$

${X}\succeq 0,$

where $\,\rho$ is assumed to be one without loss of generality. Duality allows us to rewrite this in saddle-function format:

$\min_{{U}\in{Q}_1}f({U})$

where

${Q}_1=\{{U}\in \textbf{S}^{n}:|{U}_{ij}|\leq 1,i,j=1,...,n\},{Q}_2=\{{X}\in\textbf{S}^n:\textbf{Tr} X=1,X\succeq0\}$

and

$f(U):=\max_{X \in Q_2}\lt TU,X\gt -\hat{\phi}(X)\; \textrm{with} \; T=I_{n^2}, \; \hat{\phi}(X)=-\textbf{Tr}(AX)$

Exactly as in Nesterov's <ref name='N2004'/> paper, we define the Frobenius norm and prox-functions as follows:

$d_1(U)=\frac{1}{2}U^T U$

where $U \in Q_1$ and $\,U_0$ is its center which satisfies $\,d_1(U_0)=0$ and is defined as

$U_0:=\arg \min_{U\in Q_{1}}d_1(U).$

$D_1:=\max_{U \in Q_1}d_1(U)=n^2/2.$

The following can be concluded for $Q_2$ as well:

$d_2(X)=\textbf{Tr}(XlogX)+\log(n)$

$\max_{X\in Q_2}d_2(X)\leq \log(n):=D_2$

Next we estimate the norm $(1,2)$ norm of operator $\,T$, which is defined as:

$\| T\| _{1,2}:= \max_{X,U}\lt TX,U\gt \;:\;\| X\| _{F}=1,\;\|U\|^*_2=1$

$=\max_X\|X\|_2\;:\;\|X\|_F\leq 1$

$\,=1.$

So the parameter are set to the following values:

$D_1=n^2/2,\;\sigma _1=1,D_2=\log(n),\;\sigma_2=1/2,\;\|T\|_{1,2}=1.$

#### Regularization

As we discussed earlier and as is discussed in Nesterov's <ref name="N2004"/> paper, the regularization parameter which is $\,\mu$,

$\mu:=\frac{\epsilon}{2D_2}$

changes the non-smooth function $\,f(X)$ of the original problem to

$\min_{U \in Q_1}f_{\mu}(U)$

where $\,f_{\mu}$ is the penalized function with prox-function $d_2$:

$f_{\mu}(U):=\max_{X \in Q_2}\lt TU,X\gt -\hat{\phi}(X)-\mu d_2(X)$

Now $\,f_\mu$ is a smooth approximation of $\,f$ on $\,Q_2$ and also $\,f_\mu$ is a Lipschitz-continuous function with constant Lipshictz constant, $\,L$, which is:

$L:=\frac{D_2\|T\|_{1,2}^2}{\epsilon 2\sigma_2}$

In our problem, the function can be computed explicitly as:

$f_\mu(U)=\mu \log(\textbf{Tr}\exp((A+U)/\mu))-\mu \log n.$

#### First order minimization

The input of this section is the smooth convex function $\,f_\mu$ which is defined and computed in the previous section. Now the algorithm proceeds as follows <ref name="book2003">Introductory Lectures on Convex Optimization, Springer, 2003.</ref>:

Repeat

1. Compute $\,f_\mu(U_k)$ and $\,\nabla f_\mu (U_k).$
2. Find $Y_k={\arg\min}_{Y\in Q_1}\lt \nabla f_\mu (U), Y\gt +\frac{1}{2}L\|U-Y\|_F^2.$
3. Find $Z_k={\arg\min}_X\{\frac{L}{\sigma _1}d_1 (U)+\sum_{i=0}^{k}\frac{i+1}{2}\lt \nabla f_\mu (U_i), U-U_i\gt : U \in Q_1\}.$
4. set $U_k= \frac{2}{k+3}Z_k+\frac{k+1}{k+3}Y_k.$

Until $\textrm{gap} \; \leq \epsilon$

The first step calculates the smooth function and its gradient. The second step estimates the gradient mapping and in the third and forth steps the sequence of $\,f_\mu$ whose minimum could be calculated explicitly will be updated. For more details about these steps you can see <ref name='N1983'>Y. Nesterov, A method of solving a convex programming problem with convergence rate O(1/k2), Soviet Mathematics Doklady, 27 (1983), pp. 372–376</ref> and <ref name="N2003">Smoothing technique and its application in semidefinite optimization, CORE Discussion Paper No. 2004/73, (2004).</ref>.

## An Experiment Comparing SPCA To PCA

Fig. 1 (taken from <ref name="R6">Alexandre d’Aspremont, Laurent El Ghaoui, Michael I. Jordan, and Gert R. G. Lanckriet. A Direct Formulation For Sparse PCA Using Semidefinite Programming. SIAM Review, 49 (2007), pp. 434-448.</ref>) illustrates the results of this experiment.

Fig.1 A comparison of the clusters obtained by applying PCA and DSPCA to gene expression data.

## Code

The numerical codes used in this paper are provided by A. d'Aspremont and can be found here.

<references />