a Penalized Matrix Decomposition, with Applications to Sparse Principal Components and Canonical Correlation Analysis

From statwiki
Revision as of 18:44, 7 November 2010 by Mwinlaw (talk | contribs)
Jump to navigation Jump to search

Still under construction

Introduction

Matrix decompositions or factorizations are a useful tool in identifying the underlying structure of a matrix and the data it represents. However, many of these decompositions produce dense factors which are hard to interpret. Enforcing sparse factors gives factorizations which are more amenable to interpretation. In their paper, Witten, Tibshirani, and Hastie<ref name="WTH2009">Daniela M. Witten, Robert Tibshirani, and Trevor Hastie. (2009) "A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis". Biostatistics, 10(3):515–534.</ref> develop a penalized matrix decomposition (PMD) using penalty functions on the factors to ensure sparsity and ease of interpretation. They divide their paper into three major components. They begin by presenting their algorithm for PMD and derive efficient versions for two sets of common penalty functions. In addition, they use a particular form of their algorithm to derive a sparse version of principal component analysis (PCA). Comparing this version to two other sparse PCA methods by Jolliffe and others<ref name="JTU2003">Ian T. Jolliffe, Nickolay T. Trendafilov, and Mudassir Uddin. (2003) "A modified principal component technique based on the lasso". Journal of Computational and Graphical Statistics, 12(3):531–547. </ref> and Zou and others<ref name="ZHT2006>Hui Zou, Trevor Hastie, and Robert Tibshirani. (2006) "Sparse Principal Component Analysis". Journal of Computational and Graphical Statistics, 15(2):265–286.</ref> they show how these three methods are related. In particular, they show how their sparse PCA algorithm can be used to efficiently solve the SCoTLASS problem proposed by Jolliffe and others<ref name="JTU2003"/>; a computationally hard problem to solve in its original form. Finally, they use their PMD to yield a new method for penalized canonical correlation analysis (CCA). The main application of this procedure is to genomic data. They argue that since it is becoming increasingly common for biologists to perform multiple assays on the same set of samples there is an increased need for methods that perform inference across data sets. To this end they demonstrate their penalized CCA method on a genomic data set consisting of gene expression and DNA copy number measurements on the same set of patient samples. Using penalized CCA they can identify sets of genes that are correlated with regions of copy number change.

References

<references />