a Penalized Matrix Decomposition, with Applications to Sparse Principal Components and Canonical Correlation Analysis

From statwiki
Jump to navigation Jump to search

Still under construction

Introduction

Matrix decompositions or factorizations are a useful tool in identifying the underlying structure of a matrix and the data it represents. However, many of these decompositions produce dense factors which are hard to interpret. Enforcing sparse factors gives factorizations which are more amenable to interpretation. In their paper, Witten, Tibshirani, and Hastie<ref name="WTH2009">Daniela M. Witten, Robert Tibshirani, and Trevor Hastie. (2009) "A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis". Biostatistics, 10(3):515–534.</ref> develop a penalized matrix decomposition (PMD) using penalty functions on the factors to ensure sparsity and ease of interpretation. They divide their paper into three major components. They begin by presenting their algorithm for PMD and derive efficient versions for two sets of common penalty functions. In addition, they use a particular form of their algorithm to derive a sparse version of principal component analysis (PCA). Comparing this version to two other sparse PCA methods by Jolliffe and others<ref name="JTU2003">Ian T. Jolliffe, Nickolay T. Trendafilov, and Mudassir Uddin. (2003) "A modified principal component technique based on the lasso". Journal of Computational and Graphical Statistics, 12(3):531–547. </ref> and Zou and others<ref name="ZHT2006>Hui Zou, Trevor Hastie, and Robert Tibshirani. (2006) "Sparse Principal Component Analysis". Journal of Computational and Graphical Statistics, 15(2):265–286.</ref> they show how these three methods are related. In particular, they show how their sparse PCA algorithm can be used to efficiently solve the SCoTLASS problem proposed by Jolliffe and others<ref name="JTU2003"/>; a computationally hard problem to solve in its original form. Finally, they use their PMD to yield a new method for penalized canonical correlation analysis (CCA). The main application of this procedure is to genomic data. They argue that since it is becoming increasingly common for biologists to perform multiple assays on the same set of samples there is an increased need for methods that perform inference across data sets. To this end they demonstrate their penalized CCA method on a genomic data set consisting of gene expression and DNA copy number measurements on the same set of patient samples. Using penalized CCA they can identify sets of genes that are correlated with regions of copy number change.

Penalized Matrix Decomposition

The PMD is a generalization of the singular value decomposition (SVD) with additional constraints. The SVD of a matrix [math]\displaystyle{ \textbf{X} \in \mathbb{R}^{n \times p} }[/math] with [math]\displaystyle{ \textrm{rank}(K)\leq \min(n,p) }[/math] can be written as follows

[math]\displaystyle{ \textbf{X} = \textbf{U}\textbf{D}\textbf{V}^T, \; \textbf{U}^T\textbf{U} = \textbf{I}_n, \; \textbf{V}^T\textbf{V} = \textbf{I}_p, \; d_1 \geq d_2 \geq \cdots \geq d_k \gt 0, \;\;(1) }[/math]

where the overall mean of [math]\displaystyle{ \textbf{X} }[/math] is assumed equal to 0. Alternatively, if we let [math]\displaystyle{ \textbf{u}_k }[/math] be the [math]\displaystyle{ k }[/math]th column of [math]\displaystyle{ \textbf{U} }[/math], [math]\displaystyle{ \textbf{v}_k }[/math] the [math]\displaystyle{ k }[/math]th column of [math]\displaystyle{ \textbf{V} }[/math], and [math]\displaystyle{ \ d_k }[/math] the [math]\displaystyle{ k }[/math]th diagonal element of the diagonal matrix [math]\displaystyle{ \textbf{D} }[/math] then the SVD of [math]\displaystyle{ \textbf{X} }[/math] can be written as

[math]\displaystyle{ \textbf{X} = \sum_{k=1}^K d_k\textbf{u}_k\textbf{v}_k^T. \;\;(2) }[/math]

By imposing additional constraints on [math]\displaystyle{ \textbf{u}_k }[/math] and [math]\displaystyle{ \textbf{v}_k }[/math] the authors derive a penalized matrix decomposition. For simplicity, the authors begin by considering the rank-1 approximation. In this case, the PMD is the solution to the following optimization problem:

[math]\displaystyle{ \min_{d,\textbf{u},\textbf{v}} \frac{1}{2}\|\textbf{X} - d\textbf{u}\textbf{v}^T \|^2_F \;\; \textrm{ subject to } \;\; \| \textbf{u} \|_2^2 = 1, \; \| \textbf{v} \|^2_2 = 1, \; P_1(\textbf{u}) \leq c_1, \; P_2(\textbf{v}) \leq c_2, \; d \geq 0. \;\;\;\; (3) }[/math]

The penalty functions [math]\displaystyle{ \ P_1 }[/math] and [math]\displaystyle{ \ P_2 }[/math] are any convex functions. The authors consider the lasso function and fussed lasso function which have the following functional forms:

1. lasso: [math]\displaystyle{ P_1(\textbf{u}) = \sum_{i=1}^n |u_i| }[/math]
2. fussed lasso: [math]\displaystyle{ P_1(\textbf{u}) = \sum_{i=1}^n |u_i| + \lambda \sum_{i=1}^n |u_i-u_i-1| }[/math], where [math]\displaystyle{ \ \lambda \gt 0 }[/math].

The values of [math]\displaystyle{ \ c_1 }[/math] and [math]\displaystyle{ \ c_2 }[/math] are user defined parameters which can be determined by cross-validation or can be chosen to give a desired level of sparsity in the factors.

Before solving (3), the authors simplify the objective function using the following theorem.

Theorem 1

Let [math]\displaystyle{ \textbf{U} }[/math] and [math]\displaystyle{ \textbf{V} }[/math] be [math]\displaystyle{ n \times K }[/math] and [math]\displaystyle{ p \times K }[/math] orthogonal matrices and [math]\displaystyle{ \textbf{D} }[/math] a diagonal matrix with diagonal elements [math]\displaystyle{ \ d_k }[/math]. Then

[math]\displaystyle{ \frac{1}{2}\|\textbf{X} - \textbf{U}\textbf{D}\textbf{V}^T \|^2_F = \frac{1}{2}\|\textbf{X}\|^2_F - \sum_{k=1}^K\textbf{u}_k^T\textbf{X}\textbf{v}_k{d_k} + \frac{1}{2}\sum_{k=1}^K d_k^2. }[/math]

From Theorem 1 it is clear that the values of [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math] that solve (3) must also solve the following problem:
[math]\displaystyle{ \max_{\textbf{u},\textbf{v}} \textbf{u}^T\textbf{X}\textbf{v} \; \textrm{subject to} \; \| \textbf{u} \|_2^2 = 1, \; \| \textbf{v} \|^2_2 = 1, \; P_1(\textbf{u}) \leq c_1, \; P_2(\textbf{v}) \leq c_2, }[/math]

where [math]\displaystyle{ d = \textbf{u}^T\textbf{X}\textbf{v} }[/math] solves the original problem. The goal is to transform (5) into a biconvex problem in [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math] meaning that if [math]\displaystyle{ \textbf{u} }[/math] is fixed then the problem is convex in [math]\displaystyle{ \textbf{v} }[/math] and vice versa. Then each of the convex sub-problems can be used to iteratively solve the optimization problem. The objective function [math]\displaystyle{ \textbf{u}^T\textbf{X}\textbf{v} }[/math] is bilinear in [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math]; however, each of the sub-problems in not convex due to the [math]\displaystyle{ L_2 }[/math]-equality penalty on [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math]. Yet, if the [math]\displaystyle{ L_2 }[/math]-penalty is relaxed then the (rank-1) PMD becomes

[math]\displaystyle{ \max_{\textbf{u},\textbf{v}} \textbf{u}^T\textbf{X}\textbf{v} \; \textrm{subject to} \; \| \textbf{u} \|_2^2 \leq 1, \; \| \textbf{v} \|^2_2 \leq 1, \; P_1(\textbf{u}) \leq c_1, \; P_2(\textbf{v}) \leq c_2, }[/math]

and with [math]\displaystyle{ \textbf{v} }[/math] fixed the following convex problem results:

[math]\displaystyle{ \max_{\textbf{u}} \textbf{u}^T\textbf{X}\textbf{v} \; \textrm{subject to} \; \| \textbf{u} \|_2^2 \leq 1, \; P_1(\textbf{u}) \leq c_1, }[/math]

A similar convex sub-problem is obtained for [math]\displaystyle{ \textbf{v} }[/math] with [math]\displaystyle{ \textbf{u} }[/math] fixed. Thus, (\ref{rank1optrelax}) is biconvex in [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math] and this leads to the following iterative algorithm using the two convex sub-problems.

Algorithm 1: Computation of single-factor PMD model

  1. Initialize [math]\displaystyle{ \textbf{v} }[/math] to have [math]\displaystyle{ L_2 }[/math]-norm equal to 1.
  2. Iterate until convergence:
    1. [math]\displaystyle{ \textbf{u} \leftarrow \arg \max_{\textbf{u}} \textbf{u}^T\textbf{X}\textbf{v} \; \textrm{subject to} \; \| \textbf{u} \|_2^2 \leq 1, \; P_1(\textbf{u}) \leq c_1 }[/math].
    2. [math]\displaystyle{ \textbf{v} \leftarrow \arg \max_{\textbf{v}} \textbf{u}^T\textbf{X}\textbf{v} \; \textrm{subject to} \; \| \textbf{v} \|_2^2 \leq 1, \; P_2(\textbf{v}) \leq c_2 }[/math].
  3. [math]\displaystyle{ d \leftarrow \textbf{u}^T\textbf{X}\textbf{v} }[/math].


Since the PMD is a generalization of the SVD it is important that if the constraints on [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math] imposed by the penalty functions are removed, any algorithm used to generate the rank-1 PMD should compute the leading singular vectors. Without the penalty functions, Algorithm 1 reduces to the power method for computing the largest eigenvector of [math]\displaystyle{ \textbf{X}^T\textbf{X} }[/math]; thus the goal is attained by the proposed algorithm. Although the algorithm does not necessarily converge to a global optimum for the relaxed problem; the empirical studies performed by the authors indicate that the algorithm does converge to interpretable factors for appropriate choices of the penalty terms. The authors also note that the algorithm has the desirable property that at each iteration of Step 2 the objective function in the relaxed problem decreases.

Algorithm 1 can easily be extended to obtain multiple factors of the PMD. This can be done by minimizing the single-factor criterion (\ref{rank1optrelax}) repeatedly, each time using as the [math]\displaystyle{ \textbf{X} }[/math] matrix the residual obtained by subtracting from the data matrix the previous factors found. The algorithm is as follows

Algorithm 2: Computation of [math]\displaystyle{ K }[/math] factors of PMD

  1. Let [math]\displaystyle{ \textbf{X}^1 \leftarrow \textbf{X} }[/math].
    1. Find [math]\displaystyle{ \textbf{u}_k }[/math], [math]\displaystyle{ \textbf{v}_k }[/math] and [math]\displaystyle{ \ d_k }[/math] by applying the single-factor PMD algorithm (Algorithm 1) to data [math]\displaystyle{ \textbf{X}^k }[/math].
    2. [math]\displaystyle{ \textbf{X}^1 \leftarrow \textbf{X}^k - d_k\textbf{u}_k\textbf{v}_k^T }[/math].


Without the [math]\displaystyle{ P_1 }[/math]- and [math]\displaystyle{ P_2 }[/math]-penalty constraints, it can be shown that the [math]\displaystyle{ K }[/math]-factor PMD algorithm leads to the rank-[math]\displaystyle{ K }[/math] SVD of [math]\displaystyle{ \textbf{X} }[/math]. However, with [math]\displaystyle{ P_1 }[/math] and/or [math]\displaystyle{ P_2 }[/math], it is important to note that the solutions are not orthogonal.

Algorithms 1 and 2 are written for generic forms of the penalty functions [math]\displaystyle{ P_1 }[/math] and [math]\displaystyle{ P_2 }[/math]. By specifying specific functional forms for [math]\displaystyle{ P_1 }[/math] and [math]\displaystyle{ P_2 }[/math] neither Algorithm 1 nor Algorithm 2 change, only Step 2 of Algorithm 1 must be modified to solve the specific optimization problem given by the forms of [math]\displaystyle{ P_1 }[/math] and [math]\displaystyle{ P_2 }[/math]. There are two specific forms that authors focus on in their analysis which they denote PMD([math]\displaystyle{ L_1 }[/math],[math]\displaystyle{ L_1 }[/math]) and PMD([math]\displaystyle{ L_1 }[/math],FL).

PMD([math]\displaystyle{ L_1 }[/math],[math]\displaystyle{ L_1 }[/math])


The PMD([math]\displaystyle{ L_1 }[/math],[math]\displaystyle{ L_1 }[/math]) criterion is as follows:

[math]\displaystyle{ \max_{\textbf{u},\textbf{v}} \textbf{u}^T\textbf{X}\textbf{v} \; \textrm{ subject to } \; \| \textbf{u} \|_2^2 \leq 1, \; \| \textbf{v} \|^2_2 \leq 1, \; \| \textbf{u} \|_1 \leq c_1, \; \| \textbf{v} \|_1 \leq c_2. }[/math]

This method results in factors [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math] that are sparse for [math]\displaystyle{ c_1 }[/math] and [math]\displaystyle{ c_2 }[/math] chosen appropriately. In order to guarantee a feasible solution [math]\displaystyle{ c_1 }[/math] and [math]\displaystyle{ c_2 }[/math] must be restricted to the ranges [math]\displaystyle{ 1 \leq c_1 \leq \sqrt{n} }[/math] and [math]\displaystyle{ 1 \leq c_2 \leq \sqrt{p} }[/math]. The solution to the optimization problem in (\ref{PMDL1L1}) is a function of the soft thresholding operator, denoted by [math]\displaystyle{ S }[/math] where [math]\displaystyle{ S(a,c) = \textrm{sgn}(a)(|a| - c)_+ }[/math], [math]\displaystyle{ c \gt 0 }[/math] is a constant and [math]\displaystyle{ x_+ }[/math] is defined equal to [math]\displaystyle{ x }[/math] if [math]\displaystyle{ x \gt 0 }[/math] and 0 otherwise. The solution is given by the following lemma.

Lemma 1

Consider the optimization problem

[math]\displaystyle{ \max_{\textbf{u}} \textbf{u}^T\textbf{a} \;\; \textrm{ subject to } \;\; \| \textbf{u} \|_2^2 = 1, \; \| \textbf{u} \|_1 \leq c. }[/math]

The solution satisfies [math]\displaystyle{ \textbf{u} = \frac{S(\textbf{a},\Delta)}{\|S(\textbf{a},\Delta)\|_2} }[/math], with [math]\displaystyle{ \Delta = 0 }[/math] if this results in [math]\displaystyle{ \| \textbf{u} \|_1 \leq c }[/math]; otherwise [math]\displaystyle{ \Delta }[/math] is chosen so that [math]\displaystyle{ \| \textbf{u} \|_1 = c }[/math].


If [math]\displaystyle{ a = \textbf{X}\textbf{v} }[/math] in Lemma 1 then to solve the PMD criterion in (\ref{PMDL1L1}) using Algorithm 1, Steps 2(a) and 2(b) can be adjusted as follows.

  1. Iterate until convergence:
    1. [math]\displaystyle{ \textbf{u} \leftarrow \frac{S(\textbf{X}\textbf{v},\Delta_1)}{\|S(\textbf{X}\textbf{v},\Delta_1)\|_2} }[/math], where [math]\displaystyle{ \Delta_1 = 0 }[/math] if this results in [math]\displaystyle{ \| \textbf{u} \|_1 \leq c_1 }[/math]; otherwise [math]\displaystyle{ \Delta_1 }[/math] is chosen to be a positive constant such that [math]\displaystyle{ \| \textbf{u} \|_1 = c_1 }[/math].
    2. [math]\displaystyle{ \textbf{v} \leftarrow \frac{S(\textbf{X}^T\textbf{u},\Delta_2)}{\|S(\textbf{X}^T\textbf{u},\Delta_2)\|_2} }[/math], where [math]\displaystyle{ \Delta_2 = 0 }[/math] if this results in $\| \textbf{v} \|_1 \leq c_2</math>; otherwise [math]\displaystyle{ \Delta_2 }[/math] is chosen to be a positive constant such that [math]\displaystyle{ \| \textbf{v} \|_1 = c_2 }[/math].

To find the values of [math]\displaystyle{ \Delta_1 }[/math] and [math]\displaystyle{ \Delta_2 }[/math] for each update of [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math] the authors suggest using a binary search.

PMD([math]\displaystyle{ L_1 }[/math],FL)

The PMD([math]\displaystyle{ L_1 }[/math],FL) criterion is as follows (where "FL" stands for fussed lasso penalty):

[math]\displaystyle{ \max_{\textbf{u},\textbf{v}} \textbf{u}^T\textbf{X}\textbf{v} \; \textrm{subject to} \; \| \textbf{u} \|_2^2 \leq 1, \; \| \textbf{v} \|^2_2 \leq 1, \; \| \textbf{u} \|_1 \leq c_1, \; \sum_j|v_j| + \lambda\sum_j|v_j-v_{j-1}| \leq c_2. }[/math]

This method results in sparse [math]\displaystyle{ \textbf{u} }[/math] and [math]\displaystyle{ \textbf{v} }[/math] sparse and somewhat smooth (depending on the value of [math]\displaystyle{ \lambda \geq 0 }[/math]). However, for simplicity, rather than solving the above problem the authors solve a slightly different criterion which results from using the Lagrange form, rather than the bound form of the constraints on [math]\displaystyle{ \textbf{v} }[/math].

[math]\displaystyle{ \min_{\textbf{u},\textbf{v}} -\textbf{u}^T\textbf{X}\textbf{v} + \frac{1}{2}\textbf{v}^T\textbf{v} + \lambda_1\sum_j|v_j| + \lambda_2\sum_j|v_j-v_{j-1}| \; \textrm{subject to} \; \| \textbf{u} \|_2^2 \leq 1, \; \| \textbf{u} \|_1 \leq c_1. }[/math]

We can solve this problem by replacing Steps 2(a) and 2(b) in Algorithm 1 with the appropriate updates:

  1. Iterate until convergence:
    1. [math]\displaystyle{ \textbf{u} \leftarrow \frac{S(\textbf{X}\textbf{v},\Delta_1)}{\|S(\textbf{X}\textbf{v},\Delta_1)\|_2} }[/math], where [math]\displaystyle{ \Delta_1 = 0 }[/math] if this results in [math]\displaystyle{ \| \textbf{u} \|_1 \leq c_1 }[/math]; otherwise [math]\displaystyle{ \Delta_1 }[/math] is chosen to be a positive constant such that [math]\displaystyle{ \| \textbf{u} \|_1 = c_1 }[/math].
    2. [math]\displaystyle{ \textbf{v} \leftarrow \arg \min_{\textbf{v}}\{\frac{1}{2}\|\textbf{X}^T\textbf{u}-\textbf{v}\|^2 + \lambda_1\sum_j|v_j| + \lambda_2\sum_j|v_j-v_{j-1}|\} }[/math].

Step 2(b) can be performed using fast software implementing fused lasso regression as described in \cite{F2007}, and \cite{TW2008}.

References

<references />

References

<references />