a Penalized Matrix Decomposition, with Applications to Sparse Principal Components and Canonical Correlation Analysis

From statwiki
Jump to: navigation, search


Matrix decompositions or factorizations are useful tools in identifying the underlying structure of a matrix and the data it represents. However, many of these decompositions produce dense factors which are hard to interpret. Enforcing sparsity within the factors produces factorizations which are more amenable to interpretation. In their paper, Witten, Tibshirani, and Hastie<ref name="WTH2009">Daniela M. Witten, Robert Tibshirani, and Trevor Hastie. (2009) "A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis". Biostatistics, 10(3):515–534.</ref> develop a penalized matrix decomposition (PMD) using penalty functions on the factors to ensure sparsity and ease of interpretation. They divide their paper into three major components. They begin by presenting their algorithm for PMD and derive efficient versions for two sets of common penalty functions. In addition, they use a particular form of their algorithm to derive a sparse version of principal component analysis (PCA). Comparing this version to two other sparse PCA methods by Jolliffe and others<ref name="JTU2003">Ian T. Jolliffe, Nickolay T. Trendafilov, and Mudassir Uddin. (2003) "A modified principal component technique based on the lasso". Journal of Computational and Graphical Statistics, 12(3):531–547. </ref> and Zou and others<ref name="ZHT2006">Hui Zou, Trevor Hastie, and Robert Tibshirani. (2006) "Sparse Principal Component Analysis". Journal of Computational and Graphical Statistics, 15(2):265–286.</ref> they show how these three methods are related. In particular, they show how their sparse PCA algorithm can be used to efficiently solve the SCoTLASS problem proposed by Jolliffe and others<ref name="JTU2003"/>; a computationally hard problem to solve in its original form. Finally, they use their PMD to yield a new method for penalized canonical correlation analysis (CCA). The main application of this procedure is to genomic data. They argue that since it is becoming increasingly common for biologists to perform multiple assays on the same set of samples, there is an increased need for methods that perform inference across data sets. To this end, they demonstrate their penalized CCA method on a genomic data set consisting of gene expression and DNA copy number measurements on the same set of patient samples. Using penalized CCA they can identify sets of genes that are correlated with regions of copy number change.

Penalized Matrix Decomposition

The PMD is a generalization of the singular value decomposition (SVD). The SVD of a matrix [math]\textbf{X} \in \mathbb{R}^{n \times p}[/math] with rank([math]K[/math]) [math]\leq \min(n,p)[/math] can be written as follows

[math]\textbf{X} = \textbf{U}\textbf{D}\textbf{V}^T, \; \textbf{U}^T\textbf{U} = \textbf{I}_n, \; \textbf{V}^T\textbf{V} = \textbf{I}_p, \; d_1 \geq d_2 \geq \cdots \geq d_k \gt 0. \;\; (1) [/math]

In their paper the authors assume the overall mean of [math]\textbf{X}[/math] is 0. If we let [math]\textbf{u}_k[/math] be the [math]k[/math]th column of [math]\textbf{U}[/math], [math]\textbf{v}_k[/math] the [math]k[/math]th column of [math]\textbf{V}[/math], and [math]d_k[/math] the [math]k[/math]th diagonal element of the diagonal matrix [math]\textbf{D}[/math] then the SVD of [math]\textbf{X}[/math] can also be written in the following form

[math]\textbf{X} = \sum_{k=1}^K d_k\textbf{u}_k\textbf{v}_k^T= arg \min_{\hat\textbf{X}\in M(r)}\|\textbf{X}-\hat\textbf{X}|^2_F . \;\; (2) [/math]

It can be seen that the first [math]r[/math] components of the singular value decomposition will give the best rank-[math]r[/math] approximation to the matrix.

By imposing alternative constraints on [math]\textbf{u}_k[/math] and [math]\textbf{v}_k[/math] the authors derive a penalized matrix decomposition (PMD). For simplicity, the authors begin by considering the rank-1 approximation. In this case, the PMD is the solution to the following optimization problem:

[math] \min_{d,\textbf{u},\textbf{v}} \frac{1}{2}\|\textbf{X} - d\textbf{u}\textbf{v}^T \|^2_F \; \textrm{ subject } \; \textrm{ to } \; \| \textbf{u} \|_2^2 = 1, \; \| \textbf{v} \|^2_2 = 1, \; P_1(\textbf{u}) \leq c_1, \; P_2(\textbf{v}) \leq c_2, \; d \geq 0, \;\; (3) [/math]

where the convex penalty functions, [math]P_1[/math] and [math]P_2[/math], can take on a number of different forms. The authors focus their analysis on the lasso function and fused lasso function which have the following functional forms:

  1. lasso: [math]P_1(\textbf{u}) = \sum_{i=1}^n |u_i|[/math] and
  2. fused lasso: [math]P_1(\textbf{u}) = \sum_{i=1}^n |u_i| + \lambda \sum_{i=1}^n |u_i-u_{i-1}|[/math], where [math]\lambda \gt 0[/math].

The values of [math]c_1[/math] and [math]c_2[/math] in (3) are user defined parameters which can be determined by cross-validation or can be chosen to give a desired level of sparsity in the factors.

Before solving (3), the authors simplify the objective function using the following theorem.

Theorem 1

Let [math]\textbf{U}[/math] and [math]\textbf{V}[/math] be [math]n \times K[/math] and [math]p \times K[/math] orthogonal matrices and [math]\textbf{D}[/math] a diagonal matrix with diagonal elements [math]d_k[/math]. Then

[math]\frac{1}{2}\|\textbf{X} - \textbf{U}\textbf{D}\textbf{V}^T \|^2_F = \frac{1}{2}\|\textbf{X}\|^2_F - \sum_{k=1}^K\textbf{u}_k^T\textbf{X}\textbf{v}_k{d_k} + \frac{1}{2}\sum_{k=1}^K d_k^2. \;\;\;(4) [/math]


From Theorem 1 it is clear that the values of [math]\textbf{u}[/math] and [math]\textbf{v}[/math] that solve (3) must also solve the following problem:

[math] \max_{\textbf{u},\textbf{v}} \textbf{u}^T\textbf{X}\textbf{v} \; \textrm{ subject } \; \textrm{ to } \; \| \textbf{u} \|_2^2 = 1, \; \| \textbf{v} \|^2_2 = 1, \; P_1(\textbf{u}) \leq c_1, \; P_2(\textbf{v}) \leq c_2, \;\;\; (5) [/math]

where [math]d = \textbf{u}^T\textbf{X}\textbf{v}[/math]. The goal is to transform (5) into a biconvex problem in [math]\textbf{u}[/math] and [math]\textbf{v}[/math]. Then each of the convex subproblems can be used to iteratively solve the optimization problem. The objective function [math]\textbf{u}^T\textbf{X}\textbf{v}[/math] is bilinear in [math]\textbf{u}[/math] and [math]\textbf{v}[/math]; however, each of the subproblems in not convex due to the [math]L_2[/math]-equality penalty on [math]\textbf{u}[/math] and [math]\textbf{v}[/math]. Yet, if the [math]L_2[/math]-penalty is relaxed then the (rank-1) PMD becomes

[math] \max_{\textbf{u},\textbf{v}} \textbf{u}^T\textbf{X}\textbf{v} \; \textrm{ subject } \; \textrm{ to } \; \| \textbf{u} \|_2^2 \leq 1, \; \| \textbf{v} \|^2_2 \leq 1, \; P_1(\textbf{u}) \leq c_1, \; P_2(\textbf{v}) \leq c_2, \;\;\; (6) [/math]

and with [math]\textbf{v}[/math] fixed the following convex subproblem results:

[math]\max_{\textbf{u}} \textbf{u}^T\textbf{X}\textbf{v} \; \textrm{ subject } \; \textrm{ to } \; \| \textbf{u} \|_2^2 \leq 1, \; P_1(\textbf{u}) \leq c_1, \;\;\;(7) [/math]

A similar convex sub-problem is obtained for [math]\textbf{v}[/math] with [math]\textbf{u}[/math] fixed. Thus, (6) is biconvex in [math]\textbf{u}[/math] and [math]\textbf{v}[/math] and the following iterative algorithm can be used to solve (6).

Algorithm 1: Computation of single-factor PMD

  1. Initialize [math]\textbf{v}[/math] to have [math]L_2[/math]-norm equal to 1.
  2. Iterate until convergence:
    1. [math]\textbf{u} \leftarrow \arg \max_{\textbf{u}} \textbf{u}^T\textbf{X}\textbf{v} \; \textrm{ subject } \; \textrm{ to } \; \| \textbf{u} \|_2^2 \leq 1, \; P_1(\textbf{u}) \leq c_1[/math].
    2. [math]\textbf{v} \leftarrow \arg \max_{\textbf{v}} \textbf{u}^T\textbf{X}\textbf{v} \; \textrm{ subject } \; \textrm{ to } \; \| \textbf{v} \|_2^2 \leq 1, \; P_2(\textbf{v}) \leq c_2[/math].
  3. [math]d \leftarrow \textbf{u}^T\textbf{X}\textbf{v}[/math].

Since the PMD is a generalization of the SVD, we should expect the rank-1 PMD algorithm to give the leading singular vectors, if the constraints on [math]\textbf{u}[/math] and [math]\textbf{v}[/math] imposed by the penalty functions are removed.


Without the penalty functions, Algorithm 1 reduces to the power method ,that is shown in the above equation, for computing the largest eigenvector of [math]\textbf{X}^T\textbf{X}[/math] which gives the leading singular vector of [math]\textbf{X}[/math]; thus this goal is attained by the proposed algorithm. Although the algorithm does not necessarily converge to a global optimum for the relaxed problem; the empirical studies performed by the authors indicate that the algorithm does converge to interpretable factors for appropriate choices of the penalty terms. The authors also note that the algorithm has the desirable property that at each iteration of Step 2, the objective function in the relaxed problem decreases.

Algorithm 1 can easily be extended to obtain multiple factors. This can be done by minimizing the single-factor criterion (6) repeatedly, each time using as the [math]\textbf{X}[/math] matrix the residual obtained by subtracting from the data matrix the previous factors found. This results in the following algorithm

Algorithm 2: Computation of [math]K[/math] factors of PMD

  1. Let [math]\textbf{X}^1 \leftarrow \textbf{X} [/math].
    1. Find [math]\textbf{u}_k[/math], [math]\textbf{v}_k[/math] and [math]\ d_k[/math] by applying the single-factor PMD algorithm (Algorithm 1) to data [math]\textbf{X}^k[/math].
    2. [math]\textbf{X}^1 \leftarrow \textbf{X}^k - d_k\textbf{u}_k\textbf{v}_k^T[/math].

Without the [math]P_1[/math]- and [math]P_2[/math]-penalty constraints, the [math]K[/math]-factor PMD algorithm leads to the rank-[math]K[/math] SVD of [math]\textbf{X}[/math]. However, with [math]P_1[/math] and/or [math]P_2[/math], it is important to note that the solutions are not orthogonal, unlike the SVD, since they are not in row and vector space.

Algorithms 1 and 2 are written for generic penalty functions, [math]P_1[/math] and [math]P_2[/math]. By specifying functional forms for [math]P_1[/math] and [math]P_2[/math] neither Algorithm 1 nor Algorithm 2 change, only Step 2 of Algorithm 1 must be modified to solve the specific optimization problem given by the functional forms of [math]P_1[/math] and [math]P_2[/math]. There are two specific forms that authors focus on in their analysis which they call PMD([math]L_1[/math],[math]L_1[/math]) and PMD([math]L_1[/math],FL).


The PMD([math]L_1[/math],[math]L_1[/math]) criterion is as follows:

[math] \max_{\textbf{u},\textbf{v}} \textbf{u}^T\textbf{X}\textbf{v} \; \textrm{ subject } \; \textrm{ to } \; \| \textbf{u} \|_2^2 \leq 1, \; \| \textbf{v} \|^2_2 \leq 1, \; \| \textbf{u} \|_1 \leq c_1, \; \| \textbf{v} \|_1 \leq c_2. \;\;\; (8) [/math]

This method results in factors [math]\textbf{u}[/math] and [math]\textbf{v}[/math] that are sparse for [math]c_1[/math] and [math]c_2[/math] chosen appropriately. In order to guarantee a feasible solution [math]c_1[/math] and [math]c_2[/math] must be restricted to the ranges [math]1 \leq c_1 \leq \sqrt{n}[/math] and [math]1 \leq c_2 \leq \sqrt{p}[/math]. The solution to the optimization problem in (8) is a function of the soft thresholding operator, denoted by [math]S[/math] where [math]S(a,c) = \textrm{sgn}(a)(|a| - c)_+[/math], [math]c \gt 0[/math] is a constant and [math]x_+[/math] is defined equal to [math]x[/math] if [math]x \gt 0[/math] and 0 otherwise. Using the following lemma, we can see how to define a solution to (8) in terms of [math]S[/math].

Lemma 1

Consider the optimization problem

[math]\max_{\textbf{u}} \textbf{u}^T\textbf{a} \; \textrm{ subject } \; \textrm{ to } \; \| \textbf{u} \|_2^2 \leq 1, \; \| \textbf{u} \|_1 \leq c. \;\;\; (9) [/math]

The solution satisfies [math]\textbf{u} = \frac{S(\textbf{a},\Delta)}{\|S(\textbf{a},\Delta)\|_2}[/math], with [math]\Delta = 0[/math] if this results in [math]\| \textbf{u} \|_1 \leq c[/math]; otherwise [math]\Delta[/math] is chosen so that [math]\| \textbf{u} \|_1 = c[/math].

So if [math]a = \textbf{X}\textbf{v}[/math] in Lemma 1, then to solve the PMD criterion in (8) using Algorithm 1, Steps 2(a) and 2(b) can be adjusted as follows.

  1. Iterate until convergence:
    1. [math]\textbf{u} \leftarrow \frac{S(\textbf{X}\textbf{v},\Delta_1)}{\|S(\textbf{X}\textbf{v},\Delta_1)\|_2}[/math], where [math]\Delta_1 = 0[/math] if this results in [math]\| \textbf{u} \|_1 \leq c_1[/math]; otherwise [math]\Delta_1[/math] is chosen to be a positive constant such that [math]\| \textbf{u} \|_1 = c_1[/math].
    2. [math]\textbf{v} \leftarrow \frac{S(\textbf{X}^T\textbf{u},\Delta_2)}{\|S(\textbf{X}^T\textbf{u},\Delta_2)\|_2}[/math], where [math]\Delta_2 = 0[/math] if this results in [math]\| \textbf{v} \|_1 \leq c_2[/math]; otherwise [math]\Delta_2[/math] is chosen to be a positive constant such that [math]\| \textbf{v} \|_1 = c_2[/math].

To find the values of [math]\Delta_1[/math] and [math]\Delta_2[/math] for each update of [math]\textbf{u}[/math] and [math]\textbf{v}[/math] the authors suggest using a binary search.
It can be seen that the intersection of [math]L_1[/math] and [math]L_2[/math]constraint results in both [math]\|u_1\|[/math] and [math]u_2[/math] nonzero.


The PMD([math]L_1[/math],FL) criterion is as follows (where "FL"; stands for fused lasso penalty):

[math] \max_{\textbf{u},\textbf{v}} \textbf{u}^T\textbf{X}\textbf{v} \; \textrm{ subject } \; \textrm{ to } \; \| \textbf{u} \|_2^2 \leq 1, \; \| \textbf{v} \|^2_2 \leq 1, \; \| \textbf{u} \|_1 \leq c_1, \; \sum_j|v_j| + \lambda\sum_j|v_j-v_{j-1}| \leq c_2. \;\;\; (10) [/math]

Since the constraints on [math]\textbf{u}[/math] remain the same we expect the resulting [math]\textbf{u}[/math] to be sparse and the fused lasso constraint on [math]\textbf{v}[/math] should result in [math]\textbf{v}[/math] sparse and somewhat smooth (depending on the value of [math]\lambda \geq 0[/math]). By recasting (10) using the Lagrange form, rather than the bound form of the constraints on [math]\textbf{v}[/math], we get

[math] \min_{\textbf{u},\textbf{v}} -\textbf{u}^T\textbf{X}\textbf{v} + \frac{1}{2}\textbf{v}^T\textbf{v} + \lambda_1\sum_j|v_j| + \lambda_2\sum_j|v_j-v_{j-1}| \; \textrm{ subject } \; \textrm{ to } \; \| \textbf{u} \|_2^2 \leq 1, \; \| \textbf{u} \|_1 \leq c_1. \;\;\;(11)[/math]

Solving (11) can be done by replacing Steps 2(a) and 2(b) in Algorithm 1 with the appropriate updates:

  1. Iterate until convergence:
    1. [math] \textbf{u} \leftarrow \frac{S(\textbf{X}\textbf{v},\Delta_1)}{\|S(\textbf{X}\textbf{v},\Delta_1)\|_2}[/math], where [math]\Delta_1 = 0[/math] if this results in [math]\| \textbf{u} \|_1 \leq c_1[/math]; otherwise [math]\Delta_1[/math] is chosen to be a positive constant such that [math]\| \textbf{u} \|_1 = c_1[/math].
    2. [math] \textbf{v} \leftarrow \arg \min_{\textbf{v}}\{\frac{1}{2}\|\textbf{X}^T\textbf{u}-\textbf{v}\|^2 + \lambda_1\sum_j|v_j| + \lambda_2\sum_j|v_j-v_{j-1}|\}[/math].

Step 2(b) can be performed using fast software implementing fused lasso regression as described in Friedman and others<ref name="F2007">Jerome Friedman, Trevor Hastie, Holger Hoefling, and Robert Tibshirani. (2007) "Pathwise coordinate optimization." Annals of Applied Statistics, 1:302–332.</ref> and Tibshirani and Wang<ref name ="TW2008">Robert Tibshirani and Pei Wang. (2008) "Spatial smoothing and hot spot detection for CGH data using the fused lasso." Biostatistics (Oxford, England), 9(1):18–29.</ref>.

The PMD algorithm can be used for missing value imputation which is related to SVD-based missing value imputation methods. for more information refer to <ref name="TCB2001" >TROYANSKAYA, O., CANTOR, M., SHERLOCK, G., BROWN, P., HASTIE, T., TIBSHIRANI, R., BOTSTEIN, D. AND ALTMAN, R. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 16, 520–525.</ref>

Sparse PCA

PCA is a tool used for both dimensionality reduction and data analysis. The principal components derived in PCA are linear combination of the original variables that have maximum variance and also minimize the reconstruction error. However, these derived components are often linear combination of all the original variables. Since the principal components can be computed using the SVD, we can use the earlier SVD notation, to define [math]\textbf{u}_k[/math] as a principal component, [math]\textbf{X}[/math] as the data matrix and [math]\textbf{v}_k[/math] as a loading vector where [math]\textbf{u}_k = \textbf{X}\textbf{v}_k[/math]. The columns of the data matrix represent the original variables so it is dense nature of the loading vectors that makes interpretation of the principal components difficult and has led to the development of sparse PCA methods. The authors present two existing methods for sparse PCA and derive a new method based on the PMD. They outline the three methods and their similarities as follows:

  1. SPC: The authors propose a new sparse PCA method they label SPC. They define it using the PMD criterion with [math]P_2(\textbf{v}) = \|\textbf{v}\|_1[/math], and no [math]\ P_1[/math]-constraint on [math]\textbf{u}[/math]. The authors refer to this criterion as PMD([math]\cdot[/math],[math]L_1[/math]), and it can be written as follows:

    [math] \max_{\textbf{u},\textbf{v}} \textbf{u}^T\textbf{X}\textbf{v} \; \textrm{ subject } \; \textrm{ to } \; \|\textbf{v}\|_1 \leq c_2, \; \|\textbf{u}\|_2^2 \leq 1, \; \|\textbf{v}\|_2^2 \leq 1. \;\;\; (12) [/math]

    The algorithm for PMD([math]\cdot[/math],[math]L_1[/math]) is obtained by replacing Step 2(a) of the single-factor PMD([math]L_1[/math], [math]L_1[/math]) algorithm with [math]\textbf{u} \leftarrow \frac{\textbf{X}\textbf{v}}{\|\textbf{X}\textbf{v}\|_2}[/math].

  2. SCoTLASS: Jolliffe and others<ref name="JTU2003"/> derive the SCoTLASS procedure for sparse PCA using the maximal variance property of principal components. The first sparse principal component solves the problem

    [math] \max_{\textbf{v}} \textbf{v}^T\textbf{X}^T\textbf{X}\textbf{v} \; \textrm{ subject } \; \textrm{ to } \; \|\textbf{v} \|^2_2 \leq 1, \; \|\textbf{v} \|_1 \leq c. \;\;\; (13) [/math]

    Subsequent components solve the same problem with the additional constraint that they must be orthogonal to previous components. The authors in <ref name="WTH2009"/> argue that the SCoTLASS criterion is the simplest, most natural way to define the notation of sparse principal components; however, the problem in (13) is not convex, and therefore the computations are difficult. The authors show how SPC can easily be recast as the SCoTLASS criterion and thus argue that the efficient SPC algorithm can be used to find the first SCoTLASS component. Only the first component is the solution to the SCoTLASS criterion since the SPC method does not enforce the orthogonality constraint on subsequent components.

  3. SPCA: To obtain sparse PCA, Zou and others<ref name="ZHT2006"/> use the reconstruction error property of principal components. For a single component, their sparse PCA technique solves

    [math] \min_{\theta,\textbf{v}} \| \textbf{X} - \textbf{X}\textbf{v}\theta^T \|^2_F + \lambda_1 \|\textbf{v} \|^2_2 + \lambda_2\| \textbf{v} \|_1 \; \textrm{ subject } \; \textrm{ to } \; \| \theta \|_2 = 1, \;\;\; (14) [/math]

    where [math]\ \lambda_1[/math], [math]\lambda_2 \geq 0[/math] and [math]\textbf{v}[/math] and [math]\ \theta[/math] are [math]p[/math]-vectors. By considering the bound form of (14), relaxing the [math]L_2[/math]-constraint on [math]\theta[/math] and introducing a [math]L_1[/math]-constraint on [math]\theta[/math], the authors in <ref name="WTH2009"/> recast the SPCA problem into the same form as the SCoTLASS problem. However, since constraints were introduced to recast the SPCA problem as the SCoTLASS criterion, not all SPCA solutions will be solutions to the SCoTLASS criterion and as the authors note this implies that SPCA will give solutions with lower reconstruction error than the SCoTLASS criterion.

After introducing the three sparse PCA methods and comparing their formulations, the authors then compare the performance of SPCA and SPC by comparing the proportion of variance explained by SPC and SPCA on a publicly available gene expression data set from http://icbp.lbl.gov/breastcancer/ and described in <ref name="C2006">Koei Chin, Sandy DeVries, Jane Fridlyand, Paul T. Spellman, Ritu Roydasgupta, Wen-Lin Kuo, Anna Lapuk, Richard M. Neve, Zuwei Qian, Tom Ryder, and others. (2006) "Genomic and transcriptional aberrations linked to breast cancer pathophysiologies." Cancer Cell, 10:529–541.</ref>. Since the sparse principal components for both methods are not orthogonal, special care must be taken in determining the the proportion of variance explained by the first [math]k[/math] sparse principal components. If the principal component vectors are not orthogonal then the information content in each of the components may overlap. As well, it is incorrect to calculate the total variance as the sum of the individual component variance if the principal components are correlated. These factors are accounted for in the method proposed by Shen and Huang<ref name="SH2008">Haipeng Shen and Jianhua Z. Huang. (2008) "Sparse principal component analysis via regularized low rank matrix approximation." J. Multivar. Anal., 99(6):1015–1034</ref> which determines the proportion of variance explained by the first [math]k[/math] sparse principal components using the formula [math]\textrm{tr}(\textbf{X}_k^T\textbf{X}_k)[/math], where [math]\textbf{X}_k = \textbf{X}\textbf{V}_k(\textbf{V}_k^T\textbf{V}_k)^{-1}\textbf{V}_k^T[/math]. Using this formula, SPC results in a substantially greater proportion of variance explained than SPCA.

To complete their analysis on sparse PCA, the authors consider one additional form of the PMD algorithm. In light of the fact that SPC does not produce orthogonal components, like the SCoTLASS procedure, they develop a sparse PCA method that enforces orthogonality among the [math]\textbf{u}_k[/math] vectors (but not the [math]\textbf{v}_k[/math] vectors).

Penalized CCA via PMD

Let [math]\textbf{Z}[/math] be an [math]n \times (p+q)[/math] data matrix where [math]n[/math] is the number of observations and [math]p+q[/math] is the number of variables. Suppose the number of variables can easily be partitioned into 2 meaningful sets to give [math]\textbf{Z}=[\textbf{X} \; \textbf{Y}][/math] where [math]\textbf{X} \in \mathbb{R}^{n \times p}[/math] and [math]\textbf{Y}\in \mathbb{R}^{n \times q}[/math] and where the columns of [math]\textbf{X}[/math] and [math]\textbf{Y}[/math] are centered and scaled. Then it is natural to investigate whether a relationship between [math]\textbf{X}[/math] and [math]\textbf{Y}[/math] can be found. CCA, developed by Hotelling <ref name="H1936>Harold Hotelling. (1936) "Relations between two sets of variates." Biometrika, 28:321–377.</ref> involves finding the canonical variates, [math]\textbf{u}[/math] and [math]\textbf{v}[/math], that maximize [math]\textrm{cor}(\textbf{X}\textbf{u},\textbf{Y}\textbf{v})[/math] subject to [math]\textbf{u}^T\textbf{X}^T\textbf{X}\textbf{u} \leq 1, \; \textbf{v}^T\textbf{Y}^T\textbf{Y}\textbf{v} \leq 1[/math]. The solution can be determined from the eigenvectors of a function of the covariance matrices of [math]\textbf{X}[/math] and [math]\textbf{Y}[/math]. However, the canonical variates given by the standard CCA algorithm are not sparse. In addition, these variates are not unique if [math]p[/math] or [math]q[/math] exceeds [math]n[/math]. To find sparse linear combinations of [math]\textbf{X}[/math] and [math]\textbf{Y}[/math] that are correlated, the authors add penalty constraints to the original formulation of the CCA problem. As well, the authors assume the covariance matrices of [math]\textbf{X}[/math] and [math]\textbf{Y}[/math] are equal to [math]\textbf{I}_n[/math] since diagonalizing these matrices has been shown to yield good results in other high-dimensional problems (<ref name="DFS2001">Sandrine Dudoit, Jane Fridlyand, and Terence P. Speed. (2001) "Comparison of discrimination methods for the classification of tumors using gene expression data." Journal of American Statistical Association, 96:1151–1160.</ref>, <ref name="THNC2003">Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. (2003) "Class prediction by nearest shrunken centroids, with applications to DNA microarrays." Statistical Sciences, 18(1):104–117.</ref>). This yields the following "diagonal penalized CCA":

[math] \max_{\textbf{u},\textbf{v}} \textbf{u}^T\textbf{X}^T\textbf{Y}\textbf{v} \; \textrm{ subject } \; \textrm{ to } \;\| \textbf{u} \|^2_2 \leq 1, \; \| \textbf{v} \|^2_2 \leq 1, \; P_1(\textbf{u}) \leq c_1, \; P_2(\textbf{v}) \leq c_2. \;\;\; (15) [/math]

If [math]\textbf{X}^T\textbf{Y}[/math] is replaced by [math]\textbf{X}[/math] then (15) becomes the original PMD formulation (6) and thus we can solve (15) using Algorithm 1. Following the authors earlier notation, this method is denoted PMD([math]A[/math],[math]B[/math]) where [math]A[/math] is the penalty on [math]\textbf{u}[/math] and [math]B[/math] is the penalty on [math]\textbf{v}[/math].

The authors demonstrate the PMD([math]L_1[/math],[math]L_1[/math]) method on a simple simulated example in which classical CCA cannot be applied since [math]p, q \gt n[/math]. Two sparse, orthogonal, latent factors are used to generate [math]\textbf{X}[/math] and [math]\textbf{Y}[/math]. For comparison, the authors also computed the SVD of [math]\textbf{X}^T\textbf{Y}[/math]. Compared to the SVD, PMD([math]L_1[/math], [math]L_1[/math]) does fairly well at identifying linear combinations of the underlying factors.

The authors use genomic data as their main motivation for introducing penalized CCA. They argue that since multiple assays are often used to characterize a single set of samples, some measure is needed to perform inference across the different data sets. As an example, they consider the breast cancer data set, publicly available at http://icbp.lbl.gov/breastcancer and described in <ref name="C2006"/>, where both CGH and gene expression data are available for a set of cancer samples. In this case, the goal is to identify a set of genes that have expression correlated with a set of chromosomal gains or losses. The version of PMD used in this case is PMD([math]L_1[/math], FL), where the [math]L_1[/math]-penalty is applied to the canonical variate corresponding to genes and the fused lasso penalty is applied to the canonical variate corresponding to copy number. PMD([math]L_1[/math], FL) is performed once for each of the 23 chromosomes and nonzero canonical variates were found for all chromosomes except for chromosome 2. It is clear that PMD([math]L_1[/math], FL) resulted in both sparsity and smoothness of the [math]\textbf{v}[/math] vectors. In order to asses whether P([math]L_1[/math],FL) is capturing real structure in the breast cancer data, the authors computed [math]p[/math]-values for the penalized canonical variates. For each chromosome, a [math]p[/math]-value for the penalized canonical variates was computed using permuted versions of [math]\textbf{X}[/math] and for every chromosome, except chromosome 2, the [math]p[/math]-values were found to be significant. As an additional test on the resulting canonical variates, the authors use a training set/test set approach. They divide the sample into a training set [math](\textbf{X}_{tr},\textbf{Y}_{tr})[/math] and test set [math](\textbf{X}_{te},\textbf{Y}_{te})[/math] where the test set contains about 3/4 of the entire sample. Using the training set, they calculate the canonical variates from penalized CCA and then compute [math]\textrm{cor}(\textbf{X}_{tr}\textbf{u}_{tr},\textbf{Y}_{tr}\textbf{v}_{tr})[/math] and [math]\textrm{cor}(\textbf{X}_{te}\textbf{u}_{tr},\textbf{Y}_{te}\textbf{v}_{tr})[/math]. They repeat this experiment multiple times drawing different training sets each time. For most chromosomes, the correlation in the test set is quite high given that the average value should be zero in the absence of a signal.


<references />