optimal Solutions forSparse Principal Component Analysis: Difference between revisions
Line 90: | Line 90: | ||
==Greedy Solution== | ==Greedy Solution== | ||
Before presenting their approximate greedy search algorithm for solving the SPCA problem, the authors first presented the full greedy search algorithm that follows directly from Moghaddam ''et al.'' | Before presenting their approximate greedy search algorithm for solving the SPCA problem, the authors first presented the full greedy search algorithm that follows directly from Moghaddam ''et al.'' <ref name="M2006a">B. Moghaddam, Y. Weiss, and S. Avidan. Spectral bounds for sparse PCA: Exact and greedy algorithms. ''Advances in Neural Information Processing Systems, 18, 2006.</ref> This algorithm starts from an initial solution (having cardinality one) at <math>\,\rho = \Sigma_{11}</math>, and then it updates an increasing sequence of index sets <math>\,I_k \subseteq [1, n]</math> by scanning all the remaining variables to find the index that gives the maximum contribution in terms of variance. | ||
The following pseudo-code (taken from the authors' paper listed in References) summarizes this full greedy search algorithm: | The following pseudo-code (taken from the authors' paper listed in References) summarizes this full greedy search algorithm: |
Revision as of 18:21, 12 December 2010
Introduction
Principle component analysis (PCA) is a method for finding linear combinations of features, called principle components, that correspond to the directions of maximize variance in the data and which are orthogonal to one another. In practice, performing PCA on a data set involves applying the singular value decomposition to the data matrix.
PCA facilitates the interpretation of the data if the components are linear combinations of only a few latent variables, and not many or all of the original ones. This is particularly true in many applications in which the coordinate axes that correspond to the factors have a direct physical interpretation; for instance, in financial or biological applications, each axis might correspond to a specific asset or to a specific gene. Constraining the number of non-zero factor coefficients (loadings) in sparse principal components to a very low number relative to the total number of coefficients whilst having these sparse vectors explain a maximum amount of variance in the data is known as sparse PCA. Sparse PCA has many applications in biology, finance and many machine learning problems. Sparse principal components, like principal components, are vectors that span a lower-dimensional space that explain most of variance in the original data. However, in order to find the sparse principal components using sparse PCA, it is necessary to make some sacrifices:
- There is a reduction in the explained variance in the original data captured by the sparse principal components as compared to PCA.
- There is a reduction in the orthogonality (independence or correlation) between the resulting variables (sparse principal components) as compared to PCA.
In this paper we are going to focus on the problem of sparse PCA which can be written as:
[math]\displaystyle{ \textrm{subject} \; \textrm{to} \; \|x\|_2 \le 1 }[/math]
where:
- [math]\displaystyle{ x\in \mathbb{R}^n }[/math]
- [math]\displaystyle{ \Sigma \in S_n }[/math] is the symmetric positive semidefinite sample covariance matrix
- [math]\displaystyle{ \,\rho }[/math] is the parameter which controls the sparsity
- [math]\displaystyle{ \textbf{Card}(x) }[/math] expresses the cardinality ([math]\displaystyle{ \,l_0 }[/math] norm) of [math]\displaystyle{ \,x }[/math].
Note that while solving the standard PCA problem is not complicated (since, for each factor, one simply needs to find a leading eigenvector, and this can be done in [math]\displaystyle{ \,O(n^2) }[/math] time), solving sparse PCA is NP hard (since sparse PCA is a particular case of the sparse generalized eigenvalue problem).
The main part of this paper begins by formulating the sparse PCA (SPCA) problem, whose algorithm is based on the representation of PCA as a regression-type optimization problem (Zou et al., 2006) that allows the application of the LASSO (Tibshirani, 1996) (which is a penalization technique based on the [math]\displaystyle{ \,l_1 }[/math] norm). The main part of this paper then derives an approximate greedy algorithm for computing an approximate full set of good solutions with total complexity [math]\displaystyle{ \,O(n^3) }[/math]. It also formulates a convex relaxation for sparse PCA and uses it to derive tractable sufficient conditions for a vector [math]\displaystyle{ \,x }[/math] to be a global optimum of the above optimization problem. In the general approach to SPCA described in this paper, for a given vector [math]\displaystyle{ \,x }[/math] having support [math]\displaystyle{ \,I }[/math], [math]\displaystyle{ \,x }[/math] can be tested to see if it is a globally optimal solution to the above optimization problem simply by performing a few steps of binary search to solve a one-dimensional convex minimization problem.
Notation
- For a vector [math]\displaystyle{ \,x \in\mathbb{R}^n }[/math], [math]\displaystyle{ \|x\|_1=\sum_{i=1}^n |x_i| }[/math] and [math]\displaystyle{ \textbf{Card}(x) }[/math] is the cardinality of [math]\displaystyle{ \,x }[/math] (the number of non-zero coefficients of [math]\displaystyle{ \,x }[/math]).
- The support [math]\displaystyle{ \,I }[/math] of [math]\displaystyle{ \,x }[/math] is the set [math]\displaystyle{ \{i: x_i \neq 0\} }[/math] and [math]\displaystyle{ \,I^c }[/math] denotes its complement.
- [math]\displaystyle{ \,\beta_{+} = \max\{\beta , 0\} }[/math].
- For a symmetric [math]\displaystyle{ n \times n }[/math] matrix [math]\displaystyle{ \,X }[/math] with eigenvalues [math]\displaystyle{ \,\lambda_i }[/math], [math]\displaystyle{ \operatorname{Tr}(X)_{+}=\sum_{i=1}^{n}\max\{\lambda_i,0\} }[/math].
- The vector of all ones is written [math]\displaystyle{ \textbf{1} }[/math], and the identity matrix is written [math]\displaystyle{ \,\textbf{I} }[/math]. The diagonal matrix with the vector [math]\displaystyle{ \,u }[/math] on the diagonal is written [math]\displaystyle{ \textbf{diag}(u) }[/math].
- For [math]\displaystyle{ \Sigma \, }[/math], a symmetric [math]\displaystyle{ n \times n }[/math] matrix, we can define [math]\displaystyle{ \phi(\rho) = \max_{\|x\| \leq 1} x^T \Sigma x - \rho \textbf{Card}(x) }[/math].
Sparse PCA
The sparse PCA problem can be written as:
[math]\displaystyle{ \textrm{subject} \; \textrm{to} \; \|x\|_2=1. }[/math]
Since [math]\displaystyle{ \,\Sigma \in S_{n} }[/math], [math]\displaystyle{ \,\Sigma }[/math] has a square root. Let [math]\displaystyle{ \,A \in R^{n \times n} }[/math] denote the square root of [math]\displaystyle{ \,\Sigma }[/math] where [math]\displaystyle{ \,\Sigma = A^TA }[/math].
The above problem is directly related to the following problem which involves finding a cardinality-constrained maximum eigenvalue:
[math]\displaystyle{ \textrm{subject} \; \textrm{to} \; \|x\|_2=1,\,\,\,\,\,\,\,\,\,\,\,\,\,(1) }[/math]
[math]\displaystyle{ \textbf{Card}(x)\leq k, }[/math]in the variable [math]\displaystyle{ \,x \in R^n }[/math]. Suppose the features of [math]\displaystyle{ \,\Sigma }[/math] are ordered in decreasing size of variance, i.e. [math]\displaystyle{ \Sigma_{11} \geq \dots \geq \Sigma_{nn} }[/math].
Using duality, we can bound the solution of [math]\displaystyle{ \,(1) }[/math] by:
where [math]\displaystyle{ \,P }[/math] is the set of penalty values for which [math]\displaystyle{ \,\phi(\rho) }[/math] has been computed. This tells us that if we prove [math]\displaystyle{ \,x }[/math] is optimal for [math]\displaystyle{ \,\phi(\rho) }[/math] then [math]\displaystyle{ \,x }[/math] is the global optimum for [math]\displaystyle{ \,(1) }[/math], with the cardinality of [math]\displaystyle{ \,x }[/math] being exactly [math]\displaystyle{ \,k }[/math].
For the remainder of the paper, the authors assume [math]\displaystyle{ \rho \leq \Sigma_{11} }[/math]. To see why this is the case, we will assume for the moment that [math]\displaystyle{ \rho \geq \Sigma_{11} }[/math]. Then, since [math]\displaystyle{ x^T \Sigma x\leq \Sigma_{11}(\sum_{i=1}^n|x_i|)^2 }[/math] and [math]\displaystyle{ (\sum_{i=1}^n|x_i|)^2 \leq \|x\|^2\textbf{Card}(x) \; \forall x \in R^n }[/math] we get the following:
[math]\displaystyle{ \leq (\Sigma_{11}-\rho)\textbf{Card}(x) }[/math]
This implies that the optimal solution to the SPCA problem is simply [math]\displaystyle{ \,x = 0 }[/math]. Thus, we assume [math]\displaystyle{ \,\rho \leq \Sigma_{11} }[/math] and in this case the inequality [math]\displaystyle{ \,\|x\| \le 1 }[/math] is tight.
Using the fact that the sparsity pattern of a vector [math]\displaystyle{ \,x }[/math] can be represented by a vector [math]\displaystyle{ \,u \in \{0, 1\}^n }[/math], the fact that [math]\displaystyle{ \,\textbf{diag}(u)^2 = \textbf{diag}(u) }[/math] for all variables [math]\displaystyle{ \,u \in \{0, 1\}^n }[/math], and the fact that for any matrix [math]\displaystyle{ \,B }[/math], [math]\displaystyle{ \,\lambda_{max}(B^TB) = \lambda_{max}(BB^T) }[/math], the SPCA problem can be re-expressed as:
[math]\displaystyle{ \,\phi(\rho) \; = \max_{u \in \{0,1\}^n} \; \lambda_{max}(\textbf{diag}(u) \; \Sigma \; \textbf{diag}(u)) - \rho\textbf{1}^Tu }[/math]
- [math]\displaystyle{ = \max_{u \in \{0,1\}^n} \; \lambda_{max}(\textbf{diag}(u) \; A^TA \; \textbf{diag}(u)) - \rho\textbf{1}^Tu }[/math]
- [math]\displaystyle{ = \max_{u \in \{0,1\}^n} \; \lambda_{max}(A \; \textbf{diag}(u) \; A^T) - \rho\textbf{1}^Tu }[/math]
- [math]\displaystyle{ = \max_{ \|x\| = 1} \; \max_{u \in \{0,1\}^n} x^T A \; \textbf{diag}(u) \; A^T x - \rho\textbf{1}^Tu }[/math]
- [math]\displaystyle{ = \max_{ \|x\| = 1} \; \max_{u \in \{0,1\}^n} \sum_{i=1}^n u_i((a_i^T x)^2 - \rho) }[/math].
Then, if we maximize in [math]\displaystyle{ \,u }[/math] and use the fact that [math]\displaystyle{ \,max_{v \in \{0,1\}} \beta v = \beta_+ }[/math], the SPCA problem, in the case where [math]\displaystyle{ \,\rho \le \Sigma_{11} }[/math], becomes:
which is a non-convex problem in [math]\displaystyle{ \,x \in R^n }[/math]. Note that, in this non-convex problem, we only need to select the values [math]\displaystyle{ \,i }[/math] at which [math]\displaystyle{ \,(a_i^T x)^2 - \rho \gt 0 }[/math].
Here, the [math]\displaystyle{ \,a_i }[/math]'s are the columns of the matrix [math]\displaystyle{ \,A }[/math] where [math]\displaystyle{ \,A^T A = \Sigma }[/math] (i.e. [math]\displaystyle{ \,A }[/math] is the square root of [math]\displaystyle{ \,\Sigma }[/math]).
For more details refer to <ref name= "afl" > Alexandre d'Aspremont, Francis Bach, and Laurent El Ghaoui. 2008. Optimal Solutions for Sparse Principal Component Analysis. J. Mach. Learn. Res. 9 (June 2008), 1269-1294. </ref>.
Greedy Solution
Before presenting their approximate greedy search algorithm for solving the SPCA problem, the authors first presented the full greedy search algorithm that follows directly from Moghaddam et al. <ref name="M2006a">B. Moghaddam, Y. Weiss, and S. Avidan. Spectral bounds for sparse PCA: Exact and greedy algorithms. Advances in Neural Information Processing Systems, 18, 2006.</ref> This algorithm starts from an initial solution (having cardinality one) at [math]\displaystyle{ \,\rho = \Sigma_{11} }[/math], and then it updates an increasing sequence of index sets [math]\displaystyle{ \,I_k \subseteq [1, n] }[/math] by scanning all the remaining variables to find the index that gives the maximum contribution in terms of variance.
The following pseudo-code (taken from the authors' paper listed in References) summarizes this full greedy search algorithm:
At every step, [math]\displaystyle{ \,I_k }[/math] represents the set of non-zero elements, or the sparsity pattern, of the current point. Given [math]\displaystyle{ \,I_k }[/math], the solution to the SPCA problem can be defined as [math]\displaystyle{ \,x_k = \underset{\{x_{I_k^c} = 0, \|x\| = 1\}}{\operatorname{argmax}} x^T \Sigma x - \rho k }[/math], i.e. [math]\displaystyle{ \,x_k }[/math] is formed simply by padding zeros to the leading eigenvector of the sub-matrix [math]\displaystyle{ \,\Sigma_{I_k,I_k} }[/math].
As estimating [math]\displaystyle{ \,n-k }[/math] eigenvalues at each iteration is costly, we get help from the fact that [math]\displaystyle{ \,uu^T }[/math] is a sub-gradient of [math]\displaystyle{ \,\lambda_{max} }[/math] at [math]\displaystyle{ \,X }[/math] if [math]\displaystyle{ \,u }[/math] is a leading eigenvector of [math]\displaystyle{ \,X }[/math] to get [math]\displaystyle{ \lambda_{max}(\sum_{j\in I_k\cup \{i\}}a_ja_j^T)\geq \lambda_{max}(\sum_{j\in I_k}a_ja_j^T)+(x_k^Ta_i)^2 }[/math]. With this, the authors had a lower bound on the objective which does not require finding [math]\displaystyle{ \,n - k }[/math] eigenvalues at each iteration.
The authors then derived the following algorithm for solving the SPCA problem:
Approximate Greedy Search Algorithm
Input: [math]\displaystyle{ \Sigma \in \textbf{R}^{n\times n} }[/math]
Algorithm:
1.Preprocessing: sort variables decreasingly diagonal elements and permute elements of [math]\displaystyle{ \Sigma }[/math] accordingly. Compute Cholesky decomposition [math]\displaystyle{ \,\Sigma =A^TA }[/math].
2.Initialization:[math]\displaystyle{ I_1=\{\}, x_1=a_1/\|a_1\| }[/math]
3.Compute [math]\displaystyle{ i_k= {\arg\max}_{i\notin I_k}(x_k^Ta_i)^2 }[/math]
4.Set [math]\displaystyle{ I_{k+1}=I_k\cup\{i_k\} }[/math] and compute [math]\displaystyle{ \,x_{k+1} }[/math] as the leading eigenvector of [math]\displaystyle{ \sum_{j\in I_{k+1}}a_j a_j^T }[/math]
5.Set [math]\displaystyle{ \,k=k+1 }[/math] if [math]\displaystyle{ \,k\lt n }[/math] go back to step 3.
Output: sparsity patterns [math]\displaystyle{ \,I_k }[/math]
As in the full greedy search algorithm, at every step, [math]\displaystyle{ \,I_k }[/math] represents the set of non-zero elements, or the sparsity pattern, of the current point and, given [math]\displaystyle{ \,I_k }[/math], the solution to the SPCA problem can be defined as [math]\displaystyle{ \,x_k =
\underset{\{x_{I_k^c} = 0, \|x\| = 1\}}{\operatorname{argmax}} x^T \Sigma x - \rho k }[/math], i.e. we form [math]\displaystyle{ \,x_k }[/math] simply by padding zeros to the leading eigenvector of the sub-matrix [math]\displaystyle{ \,\Sigma_{I_k,I_k} }[/math].
Computational Complexity
The full greedy search algorithm for solving the SPCA problem has a complexity of [math]\displaystyle{ \,O(n^4) }[/math] because, at each step [math]\displaystyle{ \,k }[/math], it computes [math]\displaystyle{ \,n-k }[/math] maximum eigenvalues of matrices having size [math]\displaystyle{ \,k }[/math]. On the other hand, the authors' approximate greedy search algorithm for solving the SPCA problem has a complexity of [math]\displaystyle{ \,O(n^3) }[/math]. This is because the first Cholesky decomposition has a complexity of [math]\displaystyle{ \,O(n^3) }[/math] and, in the [math]\displaystyle{ \,k }[/math]th iteration, there is a complexity of [math]\displaystyle{ \,O(k^2) }[/math] for the maximum eigenvalue problem and a complexity of [math]\displaystyle{ \,O(n^2) }[/math] for finding all products [math]\displaystyle{ \,x^T a_j }[/math].
Convex Relaxation
As mentioned above, the sparse PCA problem can be written as:
.
Because the variable [math]\displaystyle{ \,x }[/math] only appears through [math]\displaystyle{ \,X = xx^T }[/math], the above form of the SPCA problem can be reformulated in terms of only [math]\displaystyle{ \,X }[/math] by using the fact that, when [math]\displaystyle{ \,\|x\| = 1 }[/math], [math]\displaystyle{ \,X = xx^T }[/math] is equivalent to [math]\displaystyle{ \,\textbf{Tr}(X) = 1 }[/math], [math]\displaystyle{ \,X \ge 0 }[/math] and [math]\displaystyle{ \,\textbf{Rank}(X) = 1 }[/math].
Thus, the authors obtained the following form of the SPCA problem:
As the goal of the above form of the SPCA problem is to maximize a convex function over the convex set (http://en.wikipedia.org/wiki/Spectrahedron spectahedron]) [math]\displaystyle{ \,\Delta_n=\{X\in S_n : \textbf{Tr}(X)=1, X\geq 0\} }[/math], the solution must be an extreme point of [math]\displaystyle{ \,\Delta_n }[/math] and is therefore a rank-one matrix. Unfortunately, this form of the SPCA problem is convex in [math]\displaystyle{ \,X }[/math] and not concave, so the problem is still hard to solve. However, it is shown in <ref name="afl"/> that, on rank-one elements of [math]\displaystyle{ \,\Delta_n }[/math], this form of the SPCA problem is equal to a concave function of [math]\displaystyle{ X }[/math]. Using this fact, the authors produced a convex relaxation of this form of the SPCA problem.
The proposition is given below and the proof is provided in the authors' paper listed in Reference.
Proposition 1 Let [math]\displaystyle{ A\in{R}^{n\times n}, \rho \geq0 }[/math] and denotes by [math]\displaystyle{ a_1,...,a_n\in R^n }[/math] the columns of [math]\displaystyle{ A }[/math], an upper bound on:
can be computed by solving
in the variable [math]\displaystyle{ X\in S_n }[/math], where [math]\displaystyle{ B_i=a_ia_i^T-\rho I }[/math] or also:
, which is a semi-definite program in the variables [math]\displaystyle{ X\in S_n, P_i\in S_n }[/math].
It is always true that [math]\displaystyle{ \,\psi(\rho) \ge \phi(\rho) }[/math], and, when the solution to the above semi-definite program has rank one, we have that [math]\displaystyle{ \,\psi(\rho) = \phi(\rho) }[/math] and the convex relaxation ( which is [math]\displaystyle{ \psi(\rho)=\max\sum_{i=1}^n(\textbf{Tr}(P_iB_i)_{+} \;\; s.t. \; \textbf{Tr}(X)=1, X\geq 0, X\geq P_i \geq 0 }[/math] ) is tight.
Optimality Conditions
In this section, the optimality conditions considered by the authors are briefly discussed.
Dual problem and optimality conditions
The authors first derived the dual problem to the convex relaxation of SPCA ( which is [math]\displaystyle{ \psi(\rho)=\max\sum_{i=1}^n(\textbf{Tr}(P_iB_i)_{+} \;\; s.t. \; \textbf{Tr}(X)=1, X\geq 0, X\geq P_i \geq 0 }[/math] ) as well as the associated Karush-Kuhn-Tucker (KKT) optimality conditions. They began by presenting the following lemma (taken from the authors' paper listed in References and whose proof is also given in that paper):
Optimality conditions for rank one solutions
The authors then derived the KKT conditions for the convex relaxation of SPCA for the particular case in which we have a rank one candidate solution [math]\displaystyle{ \,X = xx^T }[/math] and we need to test our candidate solution's optimality. The following lemma (taken from the authors' paper listed in References and whose proof is also given in that paper) then provides an important connection between the convex relaxation of SPCA and the original non-convex form of the SPCA problem (which is [math]\displaystyle{ \;\phi(\rho) \;\; \textrm{subject} \; \textrm{to} \; \|x\|_2=1 }[/math]):
Based on the necessary and sufficient optimality conditions for the convex relaxation of SPCA as given in lemma 2, lemma 3 gives that, for any candidate vector [math]\displaystyle{ \,x }[/math], we can test the optimality of [math]\displaystyle{ \,X = xx^T }[/math] for the convex relaxation of SPCA by solving a semi-definite feasibility problem in the variables [math]\displaystyle{ \,Y_i \in S_n }[/math], and that, if the rank one candidate solution [math]\displaystyle{ X = xx^T }[/math] is optimal for the convex relaxation of SPCA, then [math]\displaystyle{ \,x }[/math] is globally optimal for the original non-convex combinatorial form of SPCA ( which is [math]\displaystyle{ \;\phi(\rho) \;\; \textrm{subject} \; \textrm{to} \; \|x\|_2=1 }[/math] ).
Solution improvements and randomization
If the conditions are not met, then the rank of the convex relaxation of SPCA's optimal solution would be strictly greater than one, and hence the convex relaxation of SPCA would not be tight. If this is the case, then a different relaxation such as DSPCA by d’Aspremont et al. (2007b) (more details regarding it is available in d’Aspremont et al.'s paper) may be used to try to get a better solution for SPCA. Furthermore, following Ben-Tal and Nemirovski (2002), randomization techniques may also be applied to improve the quality of the convex relaxation of SPCA's solution.
Application
In this section we mention the application of sparse PCA to subset selection. One of the other application is compressed sensing which you can see more detail about it on the main paper:
Subset selection
we consider [math]\displaystyle{ \,p }[/math] datapoints in [math]\displaystyle{ \,R^n }[/math] in a data matrix [math]\displaystyle{ X \in R^{p\times n} }[/math]. We are given real numbers [math]\displaystyle{ y \in R^p }[/math] to predict from [math]\displaystyle{ \,X }[/math] using linear regression, estimated by least squares. In subset selection problem we are looking for sparse coefficients [math]\displaystyle{ \,w }[/math], i.e a vector [math]\displaystyle{ \,w }[/math] with many zeros in its entries. We thus consider the problem:
Using sparsity pattern [math]\displaystyle{ u\in \{0,1\} }[/math] and optimizing with respect to [math]\displaystyle{ \,w }[/math] and rewriting the formula using generalized eigenvalue we thus have:
The simple bound on the optimal value of the subset selection problem has the following form:
where [math]\displaystyle{ B\geq0 }[/math], and thus we have:
[math]\displaystyle{ \|y\|^2-s_0\geq(k)\geq\|y\|^2-s_0-B(\min_{v\in\{0,1\}^n,1^Tv=k}\lambda_{min}(X(v)^TX(v)))^{-1} \geq\|y\|^2-s_0-B(\lambda_{min}(X^TX))^{-1} }[/math]
This bound gives a sufficient condition for optimality in subset selection, for any problem instances and any given subset.
Conclusion
Here present a new convex relaxation of sparse principal component analysis. The sufficient conditions are derived for optimality <ref name="afl"/>. Those conditions go together with efficient greedy algorithms that provide candidate solutions, many of which turn out to be optimal in practice. Finally the resulting upper bound has direct application to problems such as sparse recovery and subset selection.
References
<references />