optimal Solutions forSparse Principal Component Analysis: Difference between revisions
Line 132: | Line 132: | ||
<math>\geq\|y\|^2-s_0-B(\lambda_{min}(X^TX))^{-1}</math></center> | <math>\geq\|y\|^2-s_0-B(\lambda_{min}(X^TX))^{-1}</math></center> | ||
This bound gives a sufficient condition for optimality in subset selection, for any problem instances and any given subset. | This bound gives a sufficient condition for optimality in subset selection, for any problem instances and any given subset. | ||
==Conclusion== | |||
==References== | |||
<references /> |
Revision as of 17:05, 2 December 2010
Under construction
Introduction
Principle component analysis (PCA) is a method for finding a linear combinations of the variables called principle components,corresponding to the directions which are orthogonal and are maximizing variance in the data. PCA facilitates the interpretation of the data if their factors are just the combinations of a few variables, not all or many of them. Constraining the number of nonzero coefficients in PCA is known as sparse PCA. Sparse PCA has many application in biology, finance and many machine learning problems. Sparse principal components, like principal components, are vectors that span a lower-dimensional space that explain most of variance in the original data. However, in order to find the sparse principal components using sparse PCA, it is necessary to make some sacrifices such as:
- There is a reduction in the explained variance in the original data captured by the sparse principal components as compared to PCA.
- There is a reduction in the orthogonality (independence or correlation) between the resulting variables (sparse principal components) as compared to PCA.
In this paper we are going to focus on the problem of sparse PCA which can be written as:
[math]\displaystyle{ \textrm{subject} \; \textrm{to} \; \|x\|_2=1 }[/math]
where [math]\displaystyle{ z\in \textbf{R}^n }[/math] and [math]\displaystyle{ \sigma \in S_n }[/math] is the symmetric positive semi definite sample covariance matrix and [math]\displaystyle{ \rho }[/math] is a parameter which controls the sparsity and [math]\displaystyle{ Card(z) }[/math] express the cardinality of [math]\displaystyle{ z }[/math]. Note that solving the PCA problem is not complicated while solving Sparse PCA is NP hard.
This paper first formulate sparse PCA problem and then derive a greedy algorithm for computing a full set of good solutions with total complexity [math]\displaystyle{ O(n^3) }[/math]. It also formulate a convex relaxation for sparse PCA and use it to derive a tractable sufficient conditions for a vector [math]\displaystyle{ z }[/math] to be a global optimum of the above formula.
Notation
- For a vector [math]\displaystyle{ z }[/math], [math]\displaystyle{ \|z\|_1=\sum_{i=1}^n |z_i| }[/math] and [math]\displaystyle{ Card(z) }[/math]is the number of non-zero coefficients of [math]\displaystyle{ z }[/math], while the support [math]\displaystyle{ I }[/math] of [math]\displaystyle{ z }[/math] is the set [math]\displaystyle{ \{i: z_i \neq 0\} }[/math] and [math]\displaystyle{ I^c }[/math] denotes its complement. [math]\displaystyle{ \beta_{+} }[/math] show the maximum of [math]\displaystyle{ \{\beta , 0\} }[/math].
- For a symmetric matrix [math]\displaystyle{ X }[/math] with eigenvalues [math]\displaystyle{ \lambda_i }[/math],[math]\displaystyle{ \textbf{Tr}(X)_{+}=\sum_{i=1}^{n}\max\{\lambda_i,0\} }[/math].
- The vector of all ones is written [math]\displaystyle{ \textbf{1} }[/math]. The diagonal matrix with the vector [math]\displaystyle{ u }[/math] on the diagonal is written [math]\displaystyle{ \textbf{diag}(u) }[/math]
Sparse PCA
Sparse PCA problem can be written as:
[math]\displaystyle{ \textrm{subject} \; \textrm{to} \; \|x\|_2=1 }[/math]
The above problem is directly related to the solving the following:
[math]\displaystyle{ \textrm{subject} \; \textrm{to} \; \|x\|_2=1, }[/math]
[math]\displaystyle{ \textbf{Card}(x)\leq k. }[/math]By duality, an upper bound on the optimal value of the above problem is given by:
where [math]\displaystyle{ P }[/math] is the set of penalty values for which [math]\displaystyle{ \Phi(\rho) }[/math] has been computed. Since [math]\displaystyle{ z^T\sigma z\leq \sigma_{11}(\sum_{i=1}^n|z_i|^2)^2 }[/math] and [math]\displaystyle{ (\sum_{i=1}^n|z_i|^2)^2\leq \|z\|^2\textbf{Card}(z) }[/math] and using the above lemma while assuming [math]\displaystyle{ \rho \geq \sigma_{11} }[/math], the following will be achieved.
[math]\displaystyle{ \leq (\sum_{11}-\rho)\textbf{Card}(z) }[/math]
Rewriting the perliminary formula in the equivelant form we'll have:
For more detail refer to <ref name= "afl" > Alexandre d'Aspremont, Francis Bach, and Laurent El Ghaoui. 2008. Optimal Solutions for Sparse Principal Component Analysis. J. Mach. Learn. Res. 9 (June 2008), 1269-1294. </ref>.
Greedy Solution
The approximate algorithm which computes a full path is As estimating [math]\displaystyle{ n-k }[/math] eigenvalues at each iteration is costly, we get help from the fact that [math]\displaystyle{ uu^T }[/math] is a subgradient of [math]\displaystyle{ \lambda_max }[/math] at [math]\displaystyle{ X }[/math] if [math]\displaystyle{ u }[/math] is a leading eigenvector of [math]\displaystyle{ X }[/math] to get:
We the derive the following algorithm:
______________________________________________
Approximate Greedy Search Algorithm
Input: [math]\displaystyle{ \Sigma \in \textbf{R}^{n\times n} }[/math]
Algorithm:
1.Preprocessing: sort variables decreasingly diagonal elements and permute elements of [math]\displaystyle{ \Sigma }[/math] accordingly. Compute Cholesky decomposition [math]\displaystyle{ \Sigma =A^TA }[/math].
2.Initialization:[math]\displaystyle{ I_1=\{\}, x_1=a_1/\|a_1\| }[/math]
3.Compute [math]\displaystyle{ i_k= argmax_{i\notin I_k}(x_k^Ta_i)^2 }[/math]
4.Set [math]\displaystyle{ I_{k+1}=I_k\cup\{i_k\} }[/math] and compute [math]\displaystyle{ x_{k+1} }[/math] as the leading eigenvector of [math]\displaystyle{ \sum_{j\in I_{k+1}}a_j a_j^T }[/math]
5.Set [math]\displaystyle{ k=k+1 }[/math] if [math]\displaystyle{ k\lt n }[/math] go back to step 3.
Output: sparsity patterns [math]\displaystyle{ I_k }[/math]
Convex Relaxation
As it is mentioned sparse PCA problem can be written as:
reformulate the problem by changing [math]\displaystyle{ X=xx^T }[/math],thus we rewrite the equivalant:
Note that because we are maximizing a convex function over the convex set,[math]\displaystyle{ \Delta_n=\{X\in S_n : \textbf{Tr}(X)=1, X\geq 0\} }[/math] the solution must be an exterme point of [math]\displaystyle{ \Delta_n }[/math]. unfortunately the problem which we are going to solve is convex in [math]\displaystyle{ X }[/math] and not concave. So the problem is still harf to solve. How ever It is shown on <ref name="afl"/> that on rank one elements of [math]\displaystyle{ \Delta_n }[/math], it is equal to a concave function [math]\displaystyle{ X }[/math] and we use this to produce a semidefinite relaxation of the problem. The proposition is as below and the proof is provided in the paper.
Proposition 1 Let [math]\displaystyle{ A\in{R}^{n\times n}, \rho \geq0 }[/math] and denotes by [math]\displaystyle{ a_1,...,a_n\in R^n }[/math] the columns of [math]\displaystyle{ A }[/math] an upper bound on:
[math]\displaystyle{ s.t. \textbf{Tr}(X)=1,\textbf{Rank}(X)=1, X\geq 0 }[/math]
can be computed by solving
in the variable [math]\displaystyle{ X\in S_n }[/math]where [math]\displaystyle{ B_i=a_ia_i^T-\rho I }[/math] or also:
which is semidefinite programming in the variables [math]\displaystyle{ X\in S_n, P_i\in S_n }[/math]
Application
In this section we mention the application of sparse PCA to subset selection and compressed sensing:
Subset selection
we consider [math]\displaystyle{ p }[/math] datapoints in [math]\displaystyle{ R^n }[/math] in a data matrix [math]\displaystyle{ X \in R^{p\times n} }[/math]. We are given real numbers [math]\displaystyle{ y \in R^p }[/math] to predict from [math]\displaystyle{ X }[/math] using linear regression, estimated by least squares. In subset selection problem we are looking for sparse coefficient [math]\displaystyle{ w }[/math] i.e a vector [math]\displaystyle{ w }[/math] with many zeros. we thus consider the problem:
using sparsity patter [math]\displaystyle{ u\in \{0,1\} }[/math] and optimizing with respect to [math]\displaystyle{ w }[/math] and rewriting the formula using generalized eigenvalue we thus have:
the simple bound on the optimal value of the subset selection problem with the following form:
where [math]\displaystyle{ B\geq0 }[/math], thus we have :
This bound gives a sufficient condition for optimality in subset selection, for any problem instances and any given subset.
Conclusion
References
<references />