a Penalized Matrix Decomposition, with Applications to Sparse Principal Components and Canonical Correlation Analysis
Still under construction
Introduction
Matrix decompositions or factorizations are a useful tool in identifying the underlying structure of a matrix and the data it represents. However, many of these decompositions produce dense factors which are hard to interpret. Enforcing sparsity within the factors produces factorizations which are more amenable to interpretation. In their paper, Witten, Tibshirani, and Hastie<ref name="WTH2009">Daniela M. Witten, Robert Tibshirani, and Trevor Hastie. (2009) "A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis". Biostatistics, 10(3):515–534.</ref> develop a penalized matrix decomposition (PMD) using penalty functions on the factors to ensure sparsity and ease of interpretation. They divide their paper into three major components. They begin by presenting their algorithm for PMD and derive efficient versions for two sets of common penalty functions. In addition, they use a particular form of their algorithm to derive a sparse version of principal component analysis (PCA). Comparing this version to two other sparse PCA methods by Jolliffe and others<ref name="JTU2003">Ian T. Jolliffe, Nickolay T. Trendafilov, and Mudassir Uddin. (2003) "A modified principal component technique based on the lasso". Journal of Computational and Graphical Statistics, 12(3):531–547. </ref> and Zou and others<ref name="ZHT2006>Hui Zou, Trevor Hastie, and Robert Tibshirani. (2006) "Sparse Principal Component Analysis". Journal of Computational and Graphical Statistics, 15(2):265–286.</ref> they show how these three methods are related. In particular, they show how their sparse PCA algorithm can be used to efficiently solve the SCoTLASS problem proposed by Jolliffe and others<ref name="JTU2003"/>; a computationally hard problem to solve in its original form. Finally, they use their PMD to yield a new method for penalized canonical correlation analysis (CCA). The main application of this procedure is to genomic data. They argue that since it is becoming increasingly common for biologists to perform multiple assays on the same set of samples there is an increased need for methods that perform inference across data sets. To this end they demonstrate their penalized CCA method on a genomic data set consisting of gene expression and DNA copy number measurements on the same set of patient samples. Using penalized CCA they can identify sets of genes that are correlated with regions of copy number change.
Penalized Matrix Decomposition
The PMD is a generalization of the singular value decomposition (SVD). The SVD of a matrix [math]\displaystyle{ \textbf{X} \in \mathbb{R}^{n \times p} }[/math] with rank([math]\displaystyle{ K }[/math]) [math]\displaystyle{ \leq \min(n,p) }[/math]
can be written as follows
In their paper the authors assume the overall mean of [math]\displaystyle{ \textbf{X} }[/math] is 0. Alternatively, if we let [math]\displaystyle{ \textbf{u}_k }[/math] be the [math]\displaystyle{ k }[/math]th column of [math]\displaystyle{ \textbf{U} }[/math], [math]\displaystyle{ \textbf{v}_k }[/math] the [math]\displaystyle{ k }[/math]th column of [math]\displaystyle{ \textbf{V} }[/math], and [math]\displaystyle{ d_k }[/math] the [math]\displaystyle{ k }[/math]th diagonal element of the diagonal matrix [math]\displaystyle{ \textbf{D} }[/math] then the SVD of [math]\displaystyle{ \textbf{X} }[/math] can be written as
By imposing alternative constraints on [math]\displaystyle{ \textbf{u}_k }[/math] and [math]\displaystyle{ \textbf{v}_k }[/math] the authors derive a penalized matrix decomposition. For simplicity, the authors begin by considering the rank-1 approximation. In this case, the PMD is the solution to the following optimization problem:
where the convex penalty functions [math]\displaystyle{ P_1 }[/math] and [math]\displaystyle{ P_2 }[/math] can take on a number of different forms. The authors focus their analysis on the lasso function and fussed lasso function which have the following functional forms:
- lasso: [math]\displaystyle{ P_1(\textbf{u}) = \sum_{i=1}^n |u_i| }[/math] and
- fussed lasso: [math]\displaystyle{ P_1(\textbf{u}) = \sum_{i=1}^n |u_i| + \lambda \sum_{i=1}^n |u_i-u_i-1| }[/math], where [math]\displaystyle{ \lambda \gt 0 }[/math].
The values of [math]\displaystyle{ c_1 }[/math] and [math]\displaystyle{ c_2 }[/math] in (3) are user defined parameters which can be determined by cross-validation or can be chosen to give a desired level of sparsity in the factors.
Before solving (3), the authors simplify the objective function using the following theorem.
Theorem 1
Let [math]\displaystyle{ \textbf{U} }[/math] and [math]\displaystyle{ \textbf{V} }[/math] be [math]\displaystyle{ n \times K }[/math] and [math]\displaystyle{ p \times K }[/math] orthogonal matrices and [math]\displaystyle{ \textbf{D} }[/math] a diagonal matrix with diagonal elements [math]\displaystyle{ d_k }[/math]. Then
From Theorem 1 it is clear that the values of [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math] that solve (3) must also solve the following problem:
where [math]\displaystyle{ d = \textbf{u}^T\textbf{X}\textbf{v} }[/math] solves the original problem. The goal is to transform (5) into a biconvex problem in [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math]. Then each of the convex subproblems can be used to iteratively solve the original optimization problem. The objective function [math]\displaystyle{ \textbf{u}^T\textbf{X}\textbf{v} }[/math] is bilinear in [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math]; however, each of the subproblems in not convex due to the [math]\displaystyle{ L_2 }[/math]-equality penalty on [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math]. Yet, if the [math]\displaystyle{ L_2 }[/math]-penalty is relaxed then the (rank-1) PMD becomes
and with [math]\displaystyle{ \textbf{v} }[/math] fixed the following convex subproblem results:
A similar convex sub-problem is obtained for [math]\displaystyle{ \textbf{v} }[/math] with [math]\displaystyle{ \textbf{u} }[/math] fixed. Thus, (6) is biconvex in [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math] and the following iterative algorithm can be used to solve (6).
Algorithm 1: Computation of single-factor PMD model
- Initialize [math]\displaystyle{ \textbf{v} }[/math] to have [math]\displaystyle{ L_2 }[/math]-norm equal to 1.
- Iterate until convergence:
- [math]\displaystyle{ \textbf{u} \leftarrow \arg \max_{\textbf{u}} \textbf{u}^T\textbf{X}\textbf{v} \; \textrm{subject to} \; \| \textbf{u} \|_2^2 \leq 1, \; P_1(\textbf{u}) \leq c_1 }[/math].
- [math]\displaystyle{ \textbf{v} \leftarrow \arg \max_{\textbf{v}} \textbf{u}^T\textbf{X}\textbf{v} \; \textrm{subject to} \; \| \textbf{v} \|_2^2 \leq 1, \; P_2(\textbf{v}) \leq c_2 }[/math].
- [math]\displaystyle{ d \leftarrow \textbf{u}^T\textbf{X}\textbf{v} }[/math].
Since the PMD is a generalization of the SVD, we should expect the rank-1 PMD algorithm to give the leading singular vectors, if the constraints on [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math] imposed by the penalty functions are removed. Without the penalty functions, Algorithm 1 reduces to the power method for computing the largest eigenvector of [math]\displaystyle{ \textbf{X}^T\textbf{X} }[/math] which gives the leading singular vector of [math]\displaystyle{ \textbf{X} }[/math]; thus the goal is attained by the proposed algorithm. Although the algorithm does not necessarily converge to a global optimum for the relaxed problem; the empirical studies performed by the authors indicate that the algorithm does converge to interpretable factors for appropriate choices of the penalty terms. The authors also note that the algorithm has the desirable property that at each iteration of Step 2 the objective function in the relaxed problem decreases.
Algorithm 1 can easily be extended to obtain multiple factors. This can be done by minimizing the single-factor criterion (-rank1optrelax-) repeatedly, each time using as the [math]\displaystyle{ \textbf{X} }[/math] matrix the residual obtained by subtracting from the data matrix the previous factors found. This results in the following algorithm
Algorithm 2: Computation of [math]\displaystyle{ K }[/math] factors of PMD
- Let [math]\displaystyle{ \textbf{X}^1 \leftarrow \textbf{X} }[/math].
- Find [math]\displaystyle{ \textbf{u}_k }[/math], [math]\displaystyle{ \textbf{v}_k }[/math] and [math]\displaystyle{ d_k }[/math] by applying the single-factor PMD algorithm (Algorithm 1) to data [math]\displaystyle{ \textbf{X}^k }[/math].
- [math]\displaystyle{ \textbf{X}^1 \leftarrow \textbf{X}^k - d_k\textbf{u}_k\textbf{v}_k^T }[/math].
Without the [math]\displaystyle{ P_1 }[/math]- and [math]\displaystyle{ P_2 }[/math]-penalty constraints, the [math]\displaystyle{ K }[/math]-factor PMD algorithm leads to the rank-[math]\displaystyle{ K }[/math] SVD of [math]\displaystyle{ \textbf{X} }[/math]. However, with [math]\displaystyle{ P_1 }[/math] and/or [math]\displaystyle{ P_2 }[/math], it is important to note that the solutions are not orthogonal, unlike the SVD.
Algorithms 1 and 2 are written for generic penalty functions [math]\displaystyle{ P_1 }[/math] and [math]\displaystyle{ P_2 }[/math]. By specifying functional forms for [math]\displaystyle{ P_1 }[/math] and [math]\displaystyle{ P_2 }[/math] neither Algorithm 1 nor Algorithm 2 change, only Step 2 of Algorithm 1 must be modified to solve the specific optimization problem given by the functional forms of [math]\displaystyle{ P_1 }[/math] and [math]\displaystyle{ P_2 }[/math]. There are two specific forms that authors focus on in their analysis which they call PMD([math]\displaystyle{ L_1 }[/math],[math]\displaystyle{ L_1 }[/math]) and PMD([math]\displaystyle{ L_1 }[/math],FL) methods.
PMD([math]\displaystyle{ L_1 }[/math],[math]\displaystyle{ L_1 }[/math])
The PMD([math]\displaystyle{ L_1 }[/math],[math]\displaystyle{ L_1 }[/math]) criterion is as follows:
This method results in factors [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math] that are sparse for [math]\displaystyle{ c_1 }[/math] and [math]\displaystyle{ c_2 }[/math] chosen appropriately. In order to guarantee a feasible solution [math]\displaystyle{ c_1 }[/math] and [math]\displaystyle{ c_2 }[/math] must be restricted to the ranges [math]\displaystyle{ 1 \leq c_1 \leq \sqrt{n} }[/math] and [math]\displaystyle{ 1 \leq c_2 \leq \sqrt{p} }[/math]. The solution to the optimization problem in (-PMDL1L1-) is a function of the soft thresholding operator, denoted by [math]\displaystyle{ S }[/math] where [math]\displaystyle{ S(a,c) = \textrm{sgn}(a)(|a| - c)_+ }[/math], [math]\displaystyle{ c \gt 0 }[/math] is a constant and [math]\displaystyle{ x_+ }[/math] is defined equal to [math]\displaystyle{ x }[/math] if [math]\displaystyle{ x \gt 0 }[/math] and 0 otherwise. Using the following lemma we can define a solution to (-PMDL1L1-) in terms of [math]\displaystyle{ S }[/math].
lma1 Consider the optimization problem
The solution satisfies [math]\displaystyle{ \textbf{u} = \frac{S(\textbf{a},\Delta)}{\|S(\textbf{a},\Delta)\|_2} }[/math], with [math]\displaystyle{ \Delta = 0 }[/math] if this results in [math]\displaystyle{ \| \textbf{u} \|_1 \leq c }[/math]; otherwise [math]\displaystyle{ \Delta }[/math] is chosen so that [math]\displaystyle{ \| \textbf{u} \|_1 = c }[/math]. So if [math]\displaystyle{ a = \textbf{X}\textbf{v} }[/math] in Lemma -lma1-, then to solve the PMD criterion in (-PMDL1L1-) using Algorithm 1, Steps 2(a) and 2(b) can be adjusted as follows.
- Iterate until convergence:
- [math]\displaystyle{ \textbf{u} \leftarrow \frac{S(\textbf{X}\textbf{v},\Delta_1)}{\|S(\textbf{X}\textbf{v},\Delta_1)\|_2} }[/math], where [math]\displaystyle{ \Delta_1 = 0 }[/math] if this results in [math]\displaystyle{ \| \textbf{u} \|_1 \leq c_1 }[/math]; otherwise [math]\displaystyle{ \Delta_1 }[/math] is chosen to be a positive constant such that [math]\displaystyle{ \| \textbf{u} \|_1 = c_1 }[/math].
- [math]\displaystyle{ \textbf{v} \leftarrow \frac{S(\textbf{X}^T\textbf{u},\Delta_2)}{\|S(\textbf{X}^T\textbf{u},\Delta_2)\|_2} }[/math], where [math]\displaystyle{ \Delta_2 = 0 }[/math] if this results in [math]\displaystyle{ \| \textbf{v} \|_1 \leq c_2 }[/math]; otherwise [math]\displaystyle{ \Delta_2 }[/math] is chosen to be a positive constant such that [math]\displaystyle{ \| \textbf{v} \|_1 = c_2 }[/math].
To find the values of [math]\displaystyle{ \Delta_1 }[/math] and [math]\displaystyle{ \Delta_2 }[/math] for each update of [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math] the authors suggest using a binary search.
PMD([math]\displaystyle{ L_1 }[/math],FL)
The PMD([math]\displaystyle{ L_1 }[/math],FL) criterion is as follows (where ||||quot;FL||||quot; stands for ||||quot;fussed lasso||||quot; penalty):
Since the constraints on [math]\displaystyle{ \textbf{u} }[/math] remain the same we expect the resulting [math]\displaystyle{ \textbf{u} }[/math] to be sparse and the fussed lasso constraint on [math]\displaystyle{ \textbf{v} }[/math] should result in [math]\displaystyle{ \textbf{v} }[/math] sparse and somewhat smooth (depending on the value of [math]\displaystyle{ \lambda \geq 0 }[/math]). By recasting (-PMDL1FL-) using the Lagrange form, rather than the bound form on the constraints on [math]\displaystyle{ \textbf{v} }[/math] we get
Solving (-LPMDL1FL-) can be done by replacing Steps 2(a) and 2(b) in Algorithm 1 with the appropriate updates:
Algorithm 4: Computation of single-factor PMD([math]\displaystyle{ L_1 }[/math],FL) model
- Iterate until convergence:
- [math]\displaystyle{ \textbf{u} \leftarrow \frac{S(\textbf{X}\textbf{v},\Delta_1)}{\|S(\textbf{X}\textbf{v},\Delta_1)\|_2} }[/math], where [math]\displaystyle{ \Delta_1 = 0 }[/math] if this results in [math]\displaystyle{ \| \textbf{u} \|_1 \leq c_1 }[/math]; otherwise [math]\displaystyle{ \Delta_1 }[/math] is chosen to be a positive constant such that [math]\displaystyle{ \| \textbf{u} \|_1 = c_1 }[/math].
- [math]\displaystyle{ \textbf{v} \leftarrow \arg \min_{\textbf{v}}\{\frac{1}{2}\|\textbf{X}^T\textbf{u}-\textbf{v}\|^2 + \lambda_1\sum_j|v_j| + \lambda_2\sum_j|v_j-v_{j-1}|\} }[/math].
Step 2(b) can be performed using fast software implementing fused lasso regression as described in [F2007] and [TW2008].
plain
PMD([math]\displaystyle{ L_1 }[/math],[math]\displaystyle{ L_1 }[/math])
The PMD([math]\displaystyle{ L_1 }[/math],[math]\displaystyle{ L_1 }[/math]) criterion is as follows:
[math]\displaystyle{
\max_{\textbf{u},\textbf{v}} \textbf{u}^T\textbf{X}\textbf{v} \; \textrm{ subject to } \; \| \textbf{u} \|_2^2 \leq 1, \; \| \textbf{v} \|^2_2 \leq 1, \; \| \textbf{u} \|_1 \leq c_1, \; \| \textbf{v} \|_1 \leq c_2.
}[/math]
This method results in factors [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math] that are sparse for [math]\displaystyle{ c_1 }[/math] and [math]\displaystyle{ c_2 }[/math] chosen appropriately. In order to guarantee a feasible solution [math]\displaystyle{ c_1 }[/math] and [math]\displaystyle{ c_2 }[/math] must be restricted to the ranges [math]\displaystyle{ 1 \leq c_1 \leq \sqrt{n} }[/math] and [math]\displaystyle{ 1 \leq c_2 \leq \sqrt{p} }[/math]. The solution to the optimization problem in (\ref{PMDL1L1}) is a function of the soft thresholding operator, denoted by [math]\displaystyle{ S }[/math] where [math]\displaystyle{ S(a,c) = \textrm{sgn}(a)(|a| - c)_+ }[/math], [math]\displaystyle{ c \gt 0 }[/math] is a constant and [math]\displaystyle{ x_+ }[/math] is defined equal to [math]\displaystyle{ x }[/math] if [math]\displaystyle{ x \gt 0 }[/math] and 0 otherwise. The solution is given by the following lemma.
Lemma 1
Consider the optimization problem
[math]\displaystyle{ \max_{\textbf{u}} \textbf{u}^T\textbf{a} \;\; \textrm{ subject to } \;\; \| \textbf{u} \|_2^2 = 1, \; \| \textbf{u} \|_1 \leq c. }[/math]
The solution satisfies [math]\displaystyle{ \textbf{u} = \frac{S(\textbf{a},\Delta)}{\|S(\textbf{a},\Delta)\|_2} }[/math], with [math]\displaystyle{ \Delta = 0 }[/math] if this results in [math]\displaystyle{ \| \textbf{u} \|_1 \leq c }[/math]; otherwise [math]\displaystyle{ \Delta }[/math] is chosen so that [math]\displaystyle{ \| \textbf{u} \|_1 = c }[/math].
If [math]\displaystyle{ a = \textbf{X}\textbf{v} }[/math] in Lemma 1 then to solve the PMD criterion in (\ref{PMDL1L1}) using Algorithm 1, Steps 2(a) and 2(b) can be adjusted as follows.
- Iterate until convergence:
- [math]\displaystyle{ \textbf{u} \leftarrow \frac{S(\textbf{X}\textbf{v},\Delta_1)}{\|S(\textbf{X}\textbf{v},\Delta_1)\|_2} }[/math], where [math]\displaystyle{ \Delta_1 = 0 }[/math] if this results in [math]\displaystyle{ \| \textbf{u} \|_1 \leq c_1 }[/math]; otherwise [math]\displaystyle{ \Delta_1 }[/math] is chosen to be a positive constant such that [math]\displaystyle{ \| \textbf{u} \|_1 = c_1 }[/math].
- [math]\displaystyle{ \textbf{v} \leftarrow \frac{S(\textbf{X}^T\textbf{u},\Delta_2)}{\|S(\textbf{X}^T\textbf{u},\Delta_2)\|_2} }[/math], where [math]\displaystyle{ \Delta_2 = 0 }[/math] if this results in $\| \textbf{v} \|_1 \leq c_2</math>; otherwise [math]\displaystyle{ \Delta_2 }[/math] is chosen to be a positive constant such that [math]\displaystyle{ \| \textbf{v} \|_1 = c_2 }[/math].
To find the values of [math]\displaystyle{ \Delta_1 }[/math] and [math]\displaystyle{ \Delta_2 }[/math] for each update of [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math] the authors suggest using a binary search.
PMD([math]\displaystyle{ L_1 }[/math],FL)
The PMD([math]\displaystyle{ L_1 }[/math],FL) criterion is as follows (where "FL" stands for fussed lasso penalty):
[math]\displaystyle{ \max_{\textbf{u},\textbf{v}} \textbf{u}^T\textbf{X}\textbf{v} \; \textrm{subject to} \; \| \textbf{u} \|_2^2 \leq 1, \; \| \textbf{v} \|^2_2 \leq 1, \; \| \textbf{u} \|_1 \leq c_1, \; \sum_j|v_j| + \lambda\sum_j|v_j-v_{j-1}| \leq c_2. }[/math]
This method results in sparse [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math] sparse and somewhat smooth (depending on the value of [math]\displaystyle{ \lambda \geq 0 }[/math]). However, for simplicity, rather than solving the above problem the authors solve a slightly different criterion which results from using the Lagrange form, rather than the bound form of the constraints on [math]\displaystyle{ \textbf{v} }[/math].
[math]\displaystyle{ \min_{\textbf{u},\textbf{v}} -\textbf{u}^T\textbf{X}\textbf{v} + \frac{1}{2}\textbf{v}^T\textbf{v} + \lambda_1\sum_j|v_j| + \lambda_2\sum_j|v_j-v_{j-1}| \; \textrm{subject to} \; \| \textbf{u} \|_2^2 \leq 1, \; \| \textbf{u} \|_1 \leq c_1. }[/math]
We can solve this problem by replacing Steps 2(a) and 2(b) in Algorithm 1 with the appropriate updates:
- Iterate until convergence:
- [math]\displaystyle{ \textbf{u} \leftarrow \frac{S(\textbf{X}\textbf{v},\Delta_1)}{\|S(\textbf{X}\textbf{v},\Delta_1)\|_2} }[/math], where [math]\displaystyle{ \Delta_1 = 0 }[/math] if this results in [math]\displaystyle{ \| \textbf{u} \|_1 \leq c_1 }[/math]; otherwise [math]\displaystyle{ \Delta_1 }[/math] is chosen to be a positive constant such that [math]\displaystyle{ \| \textbf{u} \|_1 = c_1 }[/math].
- [math]\displaystyle{ \textbf{v} \leftarrow \arg \min_{\textbf{v}}\{\frac{1}{2}\|\textbf{X}^T\textbf{u}-\textbf{v}\|^2 + \lambda_1\sum_j|v_j| + \lambda_2\sum_j|v_j-v_{j-1}|\} }[/math].
Step 2(b) can be performed using fast software implementing fused lasso regression as described in \cite{F2007}, and \cite{TW2008}.
References
<references />