a Penalized Matrix Decomposition, with Applications to Sparse Principal Components and Canonical Correlation Analysis: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 14: Line 14:
\textbf{X} = \sum_{k=1}^K d_k\textbf{u}_k\textbf{v}_k^T.
\textbf{X} = \sum_{k=1}^K d_k\textbf{u}_k\textbf{v}_k^T.
</math> <br /> <br />
</math> <br /> <br />
By imposing additional constraints on <math>\textbf{u}_k</math> and <math>\textbf{v}_k</math> the authors derive the PMD.  For simplicity, the authors begin by considering the rank-1 approximation.  In this case, the PMD is the solution to the following optimization problem: <br /><br />
By imposing additional constraints on <math>\textbf{u}_k</math> and <math>\textbf{v}_k</math> the authors derive a penalized matrix decomposition.  For simplicity, the authors begin by considering the rank-1 approximation.  In this case, the PMD is the solution to the following optimization problem: <br /><br />
<math>
<math>
\min_{d,\textbf{u},\textbf{v}} \frac{1}{2}\|\textbf{X} - d\textbf{u}\textbf{v}^T \|^2_F \;\; \textrm{ subject to } \;\; \| \textbf{u} \|_2^2 = 1, \; \| \textbf{v} \|^2_2 = 1, \; P_1(\textbf{u}) \leq c_1, \; P_2(\textbf{v}) \leq c_2, \; d \geq 0.   
\min_{d,\textbf{u},\textbf{v}} \frac{1}{2}\|\textbf{X} - d\textbf{u}\textbf{v}^T \|^2_F \;\; \textrm{ subject to } \;\; \| \textbf{u} \|_2^2 = 1, \; \| \textbf{v} \|^2_2 = 1, \; P_1(\textbf{u}) \leq c_1, \; P_2(\textbf{v}) \leq c_2, \; d \geq 0.   

Revision as of 19:11, 7 November 2010

Still under construction

Introduction

Matrix decompositions or factorizations are a useful tool in identifying the underlying structure of a matrix and the data it represents. However, many of these decompositions produce dense factors which are hard to interpret. Enforcing sparse factors gives factorizations which are more amenable to interpretation. In their paper, Witten, Tibshirani, and Hastie<ref name="WTH2009">Daniela M. Witten, Robert Tibshirani, and Trevor Hastie. (2009) "A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis". Biostatistics, 10(3):515–534.</ref> develop a penalized matrix decomposition (PMD) using penalty functions on the factors to ensure sparsity and ease of interpretation. They divide their paper into three major components. They begin by presenting their algorithm for PMD and derive efficient versions for two sets of common penalty functions. In addition, they use a particular form of their algorithm to derive a sparse version of principal component analysis (PCA). Comparing this version to two other sparse PCA methods by Jolliffe and others<ref name="JTU2003">Ian T. Jolliffe, Nickolay T. Trendafilov, and Mudassir Uddin. (2003) "A modified principal component technique based on the lasso". Journal of Computational and Graphical Statistics, 12(3):531–547. </ref> and Zou and others<ref name="ZHT2006>Hui Zou, Trevor Hastie, and Robert Tibshirani. (2006) "Sparse Principal Component Analysis". Journal of Computational and Graphical Statistics, 15(2):265–286.</ref> they show how these three methods are related. In particular, they show how their sparse PCA algorithm can be used to efficiently solve the SCoTLASS problem proposed by Jolliffe and others<ref name="JTU2003"/>; a computationally hard problem to solve in its original form. Finally, they use their PMD to yield a new method for penalized canonical correlation analysis (CCA). The main application of this procedure is to genomic data. They argue that since it is becoming increasingly common for biologists to perform multiple assays on the same set of samples there is an increased need for methods that perform inference across data sets. To this end they demonstrate their penalized CCA method on a genomic data set consisting of gene expression and DNA copy number measurements on the same set of patient samples. Using penalized CCA they can identify sets of genes that are correlated with regions of copy number change.

Penalized Matrix Decomposition

The PMD is a generalization of the singular value decomposition (SVD) with additional constraints. The SVD of a matrix [math]\displaystyle{ \textbf{X} \in \mathbb{R}^{n \times p} }[/math] with [math]\displaystyle{ \textrm{rank}(K)\leq \min(n,p) }[/math] can be written as follows

[math]\displaystyle{ \textbf{X} = \textbf{U}\textbf{D}\textbf{V}^T, \; \textbf{U}^T\textbf{U} = \textbf{I}_n, \; \textbf{V}^T\textbf{V} = \textbf{I}_p, \; d_1 \geq d_2 \geq \cdots \geq d_k \gt 0, }[/math]

where the overall mean of [math]\displaystyle{ \textbf{X} }[/math] is assumed equal to 0. Alternatively, if we let [math]\displaystyle{ \textbf{u}_k }[/math] be the [math]\displaystyle{ k }[/math]th column of [math]\displaystyle{ \textbf{U} }[/math], [math]\displaystyle{ \textbf{v}_k }[/math] the [math]\displaystyle{ k }[/math]th column of [math]\displaystyle{ \textbf{V} }[/math], and [math]\displaystyle{ \ d_k }[/math] the [math]\displaystyle{ k }[/math]th diagonal element of the diagonal matrix [math]\displaystyle{ \textbf{D} }[/math] then the SVD of [math]\displaystyle{ \textbf{X} }[/math] can be written as

[math]\displaystyle{ \textbf{X} = \sum_{k=1}^K d_k\textbf{u}_k\textbf{v}_k^T. }[/math]

By imposing additional constraints on [math]\displaystyle{ \textbf{u}_k }[/math] and [math]\displaystyle{ \textbf{v}_k }[/math] the authors derive a penalized matrix decomposition. For simplicity, the authors begin by considering the rank-1 approximation. In this case, the PMD is the solution to the following optimization problem:

[math]\displaystyle{ \min_{d,\textbf{u},\textbf{v}} \frac{1}{2}\|\textbf{X} - d\textbf{u}\textbf{v}^T \|^2_F \;\; \textrm{ subject to } \;\; \| \textbf{u} \|_2^2 = 1, \; \| \textbf{v} \|^2_2 = 1, \; P_1(\textbf{u}) \leq c_1, \; P_2(\textbf{v}) \leq c_2, \; d \geq 0. }[/math]

The penalty functions [math]\displaystyle{ \ P_1 }[/math] and [math]\displaystyle{ \ P_2 }[/math] are any convex functions. The authors consider the lasso function and fussed lasso function which have the following functional forms:

1. lasso: [math]\displaystyle{ P_1(\textbf{u}) = \sum_{i=1}^n |u_i| }[/math]
2. fussed lasso: [math]\displaystyle{ P_1(\textbf{u}) = \sum_{i=1}^n |u_i| + \lambda \sum_{i=1}^n |u_i-u_i-1| }[/math], where [math]\displaystyle{ \ \lambda \gt 0 }[/math].

The values of [math]\displaystyle{ \ c_1 }[/math] and [math]\displaystyle{ \ c_2 }[/math] are user defined parameters which can be determined by cross-validation or can be chosen to give a desired level of sparsity in the factors.



References

<references />