a Direct Formulation For Sparse PCA Using Semidefinite Programming
Still under construction
Introduction
Principle Component Analysis is a popular technique which finds a transformation from correlated variables to uncorrelated ones. In sparse PCA, the embedded variables are called principle components that are linear combination of the input variables subject to the constraint that the number of non-zero elements in this combination is limited. Sparse PCA decomposition interpretation will be facilitated if the coordinate axes have a physical meaning. It is also helpful in expressing a space of a set of low-dimensional vectors with minimum loss of information.
Semidefinite programs are convex optimization. It optimize a linear function by constraining it to an affine combination of symmetric matrices is positive semidefinite. Semidefinite programs can be solved in polynomial time by interior-point solvers like SEDUMI or SDPT3. Unfortunately it is not viable and practical for high dimensional data sets.
The related paper suggests a direct approach for formulation of sparse PCA via semidefinite programming which is convex. Since interior-point solvers cannot handle large data sets, Nesterov’s smoothing technique is used efficiently to help solving large dimensional problems.
First order methods require less memory which is an important issue in interior-point solvers. On the other hand their convergence is slower but it is not a major concern in the certain case.
The content of the paper could be summarized as below:
1-Semidefinite relaxation
2-A robustness interpretation
3-Sparse decomposition
4-Algorithm
4-1-A smoothing technique
4-2-Application o sparse PCA
5-Numerical result and application
Semidefinite relaxation
A is assumed to be a symmetric matrix and WLOG, A is a covariance matrix and we are going to maximize the variance of vector x ∈ Rn while it is sparse.
Rewriting the above formulas in semidefinite programming, the following formulas are achived in square and non-square cases:
maximize [math]\displaystyle{ x^{T}{A}x }[/math]
subject to [math]\displaystyle{ \|x\|_2=1 }[/math]
[math]\displaystyle{ \textbf{Card}(x)\leq k }[/math]
maximize [math]\displaystyle{ \textbf{Tr}({A}{X}) }[/math]
subject to [math]\displaystyle{ \textbf{Tr}({X})=1 }[/math],
[math]\displaystyle{ \textbf{Card}({X})\leq k^2 }[/math],
[math]\displaystyle{ {X}\geq 0 }[/math]
[math]\displaystyle{ \textbf{Rank}({X})=1 }[/math]
maximize [math]\displaystyle{ \textbf{Tr}({A}{X}) }[/math]
subject to [math]\displaystyle{ \textbf{Tr}({X})=1 }[/math],
[math]\displaystyle{ \textbf{1}^{T}|{X}|\textbf{1}\leq k }[/math],
[math]\displaystyle{ {X}\geq 0 }[/math]
maximize [math]\displaystyle{ u^{T}{A}v }[/math]
subject to [math]\displaystyle{ \|u\|_{2}=\|v\|_{2}=1 }[/math],
[math]\displaystyle{ \textbf{Card}(u)\leq k_{1},\textbf{Card}(u)\leq k_{1} }[/math],
maximize [math]\displaystyle{ \textbf{Tr}({A}^{T}{X}^{12}) }[/math]
subject to [math]\displaystyle{ {X}\leq0, \textbf{Tr}({X}^{ii})=1 }[/math],
[math]\displaystyle{ \textbf{1}^{T}|{X}^{ii}|\textbf{1}\leq k_{i}, i=1,2 }[/math],
[math]\displaystyle{ \textbf{1}^{T}|{X}^{12}|\textbf{1}\leq \sqrt{k_1k_2} }[/math]
Robustness interpretation
maximize [math]\displaystyle{ x^{T}{A}x }[/math]
subject to [math]\displaystyle{ \|x\|_2=1 }[/math]
[math]\displaystyle{ \textbf{Card}(x)\leq k }[/math]
maximize [math]\displaystyle{ x^{T}{A}x-\textbf{Card}^{2}(x) \rho }[/math]
subject to [math]\displaystyle{ \|x\|_2=1 }[/math]
maximize [math]\displaystyle{ \textbf{Tr}({A}{X})-\textbf{Card}({X})\rho }[/math]
subject to [math]\displaystyle{ \textbf{Tr}({X})=1 }[/math],
[math]\displaystyle{ {X}\geq 0 }[/math]
[math]\displaystyle{ \textbf{Rank}({X})=1 }[/math]
maximize [math]\displaystyle{ \textbf{Tr}({A}{X})-\rho\textbf{1}^{T}|{X}|\textbf{1} }[/math]
subject to [math]\displaystyle{ \textbf{Tr}({X})=1 }[/math],
[math]\displaystyle{ {X}\geq 0 }[/math]
[math]\displaystyle{ max _{{X}\geq 0, \textbf{Tr}({X})=1} min _{|{U}_{ij}\leq \rho|}\textbf{Tr}({X}({A}+{U})) }[/math]
minimize [math]\displaystyle{ \lambda^{max}({A}+{U}) }[/math]
subject to [math]\displaystyle{ |{U}_{ij}|\leq \rho, i,j=1,...,n }[/math]
[math]\displaystyle{ ({A}+{U}){X}=\lambda^{max}({A}+{U}){X} }[/math]
Sparse decomposition
maximize [math]\displaystyle{ \textbf{Tr}({A}{X}) }[/math]
subject to [math]\displaystyle{ \textbf{Tr}({X})=1 }[/math],
[math]\displaystyle{ \textbf{1}^{T}|{X}|\textbf{1}\leq k }[/math],
[math]\displaystyle{ {X}\geq 0 }[/math]
[math]\displaystyle{ {A}_2={A}_1-(x_{1}^{T}{A}_1x_1)x_1x_1^T }[/math]
Algorithm
Smoothing technique
[math]\displaystyle{ f(x)=\hat{f}(x)+max_{u}\{\lt \textbf{T}x,u\gt -\hat{\phi}(u) : u \in \textbf{Q}_2\} }[/math]
[math]\displaystyle{ min _{x\in \textbf{Q}_1}f(x) }[/math]
Application o sparse PCA
maximize [math]\displaystyle{ \textbf{Tr}({A}{X})-\textbf{1}^{T}|{X}|\textbf{1} }[/math]
subject to [math]\displaystyle{ \textbf{Tr}({X})=1 }[/math],
[math]\displaystyle{ {X}\geq 0 }[/math]
[math]\displaystyle{ min_{{U}\in{Q}_1}f({U}) }[/math]
[math]\displaystyle{ {Q}_1=\{{U}\in \textbf{S}^{n}:|{U}_{ij}|\leq 1,i,j=1,...,n\},{Q}_2=\{{X}\in\textbf{S}^n:\textbf{Tr} X=1,X\geq0\} }[/math]
[math]\displaystyle{ f(U):=max_{X \in Q_2}\lt TU,X\gt -\hat{\phi}(X) }[/math] with [math]\displaystyle{ T=I_{n^2}, \hat{\phi}(X)=-\textbf{Tr}(AX) }[/math]
[math]\displaystyle{ d_1(U)=\frac{1}{2}U^T U }[/math]
[math]\displaystyle{ U_0:=arg min_{U\in Q_{1}}d_1(U) }[/math]
[math]\displaystyle{ D_1:=max_{U \in Q_1}d_1(U)=n^2/2 }[/math]
[math]\displaystyle{ d_2(X)=\textbf{Tr}(XlogX)+log(n) }[/math]
[math]\displaystyle{ max_{X\in Q_2}d_2(X)\leq log n:=D_2 }[/math]
[math]\displaystyle{ \| T\| _{1,2}:= max_{X,U}\lt TX,U\gt :\| X\| _{F}=1,\|U\|^*_2=1 }[/math]
[math]\displaystyle{ =max_X\|X\|_2:\|X\|_F\leq 1 }[/math]
[math]\displaystyle{ =1 }[/math]
[math]\displaystyle{ D_1=n^2/2,\sigma _1=1,D_2=log(n),\sigma_2=1/2,\|T\|_{1,2}=1 }[/math]
Regularization
[math]\displaystyle{ \mu:=\frac{\epsilon}{2D_2} }[/math]
[math]\displaystyle{ N=\frac{4\|T\|_{1,2}}{\epsilon}\sqrt{\frac{D_1D_2}{\sigma_1\sigma_2}} }[/math]
[math]\displaystyle{ min_{U \in Q_1}f_{\mu}(U) }[/math]
[math]\displaystyle{ f_{\mu}(U):=max_{X \in Q_2}\lt TU,X\gt -\hat{\phi}(X)-\mu d_2(X) }[/math]
[math]\displaystyle{ L:=\frac{D_2\|T\|_{1,2}^2}{\epsilon 2\sigma_2} }[/math]
[math]\displaystyle{ f_\mu(U)=\mu log(\textbf{Tr}exp((A+U)/\mu))-\mu log n }[/math]
First order minimization
References
<references />