residual Component Analysis: Generalizing PCA for more flexible inference in linear-Gaussian models

Introduction

Probabilistic principle component analysis (PPCA) is a method based on an isotropic error model to estimate the principal axes when any data vector has one or more missing values. It assumes that the values are missing at random through the data set. This means that whether a data value is missing or not does not depend on the latent variable given the observed data values. PPCA decomposes the covariance of a data vector $y$ in $\mathbb{R}^p$, into a low-rank term and a spherical noise term.
$y \sim \mathcal{N} (0, WW^T+\sigma I )$
$W \in \mathbb{R}^{p \times q}$ such that $q \lt p-1$ imposes a reduced rank structure on the covariance. The log-likelihood of the centered dataset $Y$ in $\mathbb{R}^{n \times p}$ with n data points and p features
$ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{i,:}|0, WW^T+\sigma^2 I)$
can be maximized<ref name="tipping1999">

Tipping, M. E. and Bishop, C.M. Probabilistic principle component analysis. Journal of the Royal Statistical Society. Series B(Statistical Methodology), 61(3):611-622,1999

</ref> with the result
$W_{ML} = U_qL_qR^T$

where $U_q$ are $q$ principle eigenvectors of the sample covariance $\tilde S$, with $\tilde S = n^{-1}Y^TY$ and $L^q$ is a diagonal matrix with elements $l_{i,i} = (\lambda_i - \sigma^2)^{1/2}$, where $\lambda_i$ is the ith eigenvalue of the sample covariance and $\sigma^2$ is the noise variance. This max-likelihood solution is rotation invariant; $R$ is an arbitrary rotation matrix. The matrix $W$ spans the principle subspace of the data and the model is known as probabilistic PCA. In face, we can get the traditional MLE by using Expectation Maximization algorithm when $\sigma$ goes into 0.

The underlying assumption of the model is that the data set can be represented by $Y = XW^T+E$ where $X$ in $\mathbb{R}^{n \times p}$ is a matrix of $q$ dimensional latent variables and $E$ is a matrix of noise variables $e_{ij} \sim \mathcal{N} (0,\sigma^2)$. The marginal log-likelihood above is obtained by placing an isotropic prior independently on the elements of $X$ with $x_{ij} \sim \mathcal{N}(0,1)$.

It is shown<ref name="lawerence2005"> Lawrence N.D. Probabilistic non-linear principle component analysis with Gaussian process latent variable models. Journal of Machine Learning . MIT Press, Cambridge, MA, 2006.

</ref> that the PCA solution is also obtained for log-likelihoods of the form
$ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{:,j}|0, XX^T+\sigma^2 I)$
This is recovered when we marginalize the loadings $W$, instead of latent variable $X$, with a Gaussian isotropic prior. This is the dual form of probabilistic PCA. This is analogous to the Dual form of PCA and similarly to the primal form, the max likelihood solution solves for the latent coordinates $X_{ML} = U^'_q L_qR^T$, instead of the principle subspace basis. Here, $U^'_q$ are the first $q$ principle eigenvectors of the inner product matrix $p^{-1}YY^T$ with $Lq$ define as before. Both primal and dual scenarios involve maximizing likelihoods of a similar covariance structure, namely when the covariance of the Gaussians is given by a low-rank term plus a spherical term. This paper considers a more general form
$XX^T+\Sigma$
Where $\Sigma$ is a general positive definite matrix. The log-likelihood of this general problem is given by
$ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{:,j}|0, XX^T+\Sigma)$......(*)
where $\Sigma = ZZ^T+ \sigma^2I$. The underlying model of this log-liklihood function can be considered as a linear mixed effect model with two factors and noise,
$Y = XW^T+ZV^T+E$
where $Z$ is a matrix of known covariates and $X$ is a matrix of latent variables.

The question this papers attempts to answer is that given $\Sigma$ how can we solve for $X$( respectively $W$), and for what values of $\Sigma$ we can formulate useful new algorithm for machine learning? This paper shows that the maximum likelihood solution for $X$ is simply based on generalized eigenvalue problem (GEP) on the sample-covariance matrix. Hence the low-rank term $XX^T$ can be optimized for general $\Sigma$. The authors call this approach residual component analysis (RCA).

Maximum likelihood RCA

Theorem: The maximum likelihood estimate of the parameter $X$ in the likelihood model in equation (*), for positive-definite and invertible $\Sigma$, is $X_{ML} = \Sigma S(D-I)^{1/2}$ where $S$ is the solution to the generalized eigenvalue problem $\frac{1}{p}YY^TS=\Sigma SD$, with its columns as the generalised eigenvectors and $D$ is diagonal with the corresponding generalized eigenvalues.

<proof>

The RCA log-likelihood is given by
$L(X,\Sigma) = - {\frac p 2} ln |K| - {\frac 1 2} tr(YY^TK^{-1})-{\frac np 2}ln(2\pi)$

Where $K=XX^T+\Sigma$.

Since $\Sigma$ is positive-deifinite we can consider the eigen-decompostion on $\Sigma$

$\Sigma = U \Lambda U^T$

where $U U^T = U^T U=I$ and $\Lambda$ is diagonal.

The projection of the covariance on to this eigen-basis, scaled by the eigenvalues gives

$\hat K = \Lambda^{-1/2}U^T K U \Lambda^{-1/2} =\Lambda^{-1/2}U^T (XX^T+\Sigma) U \Lambda^{-1/2}$.

Note that

$\Lambda^{-1/2}U^T \Sigma U \Lambda^{-1/2} = \Lambda^{-1/2}U^T U \Lambda U^T U \Lambda^{-1/2} = \Lambda^{-1/2} \Lambda \Lambda^{-1/2}=I$

Hence

$\hat K=\Lambda^{-1/2}U^TXX^TU\Lambda^{-1/2} +I$

The maximum likelihood of the RCA can be re-written as
$L(\hat X) = -(p/2)ln(|K| |\Lambda|) - (1/2) tr(\hat Y \hat Y^T \hat K^{-1})-(np/2)ln(2\pi)$

Then solve the maximum likelihood solution of for $\hat X$. Relating the stationary point of$\hat X$ to the solution for $X$and then we proceed by expressing this eigenvalue problem in terms of $YY^T$. Eventually we can recover X up to an arbitrary rotation (R, which for convenience is normally set to I), via the first q generalised eigenvectors of$(1/q)YY^T$,

$X = TL = \Sigma SL=\Sigma S(D-I)^{1/2}$

</proof>

Aside from $\Sigma$, we note a subtle difference from the PPCA solution for $W$: Whereas PPCA explicitly subtracts the noise variance from the $q$ retained principal eigenvalues, RCA implicitly incorporates any noise terms into $\Sigma$ and standardises them when it projects the total covariance onto the eigen-basis of $\Sigma$. Thus we get a reduction of unity from the retained generalised eigenvalues from the theorem. For $\Sigma=I$ the two solutions are identical.

The posterior density for the RCA probabilistic model (primal case) and $\mu_y=0$.
$x|y \sim \mathcal N (\Sigma_{x|y} W_{ML}^T \Sigma^{-1}y, \Sigma_{x|y}),$

where $\Sigma_{x|y} = (W^T_{ML} \Sigma^{-1}W_{ML} + I)^{-1}$.

Low Rank Plus Sparse Inverse

The graphical model optimised by the EM/RCA hybrid algorithm.
$y|x,z \sim \mathcal N( Wx+z, \sigma^2I),$
$x \sim \mathcal N(0,I), z \sim \mathcal N(0, \Lambda^{-1})$
where $\Lambda$ is sampled from a Laplace prior density,
$p(\Lambda) \sim exp(-\lambda \| \Lambda \|_1).$

Marginalizing $X$, yields

$log p (Y, \Lambda) = \Sigma^{n}_{i=1} log{\mathcal N(y_{i,:}|0, WW^T+\Sigma_{GL})p(\Lambda)} \ge \int q(Z)log \frac{p(Y,Z,\Lambda)}{q(Z)}dZ$

Where $q(Z)$ is the variational distribution and $\Sigma = \Lambda^{-1}+ \sigma^2I$, which we wish to optimise for some known $W$. This is an intractable problem, so instead we optimized the lower bound in an EM fashion.

E-step: Replacing $q(Z)$ with the posterior $p(Z|Y,\Lambda ')$ for a current estimate $\Lambda '$, amounts to the E-step for updating the posterior density of $\,z_n|y_n$ with
$\,cov[z|y] = ((WW^T+ \sigma^2I)^{-1} +\Lambda ')^{-1}$
$\langle z_n|y_n \rangle = cov[z_n|y_n]((WW^T+ \sigma^2I)^{-1} y_n$
$\langle z_nz_n^T \rangle = cov[z|y] + \langle z_n \rangle \langle z_n \rangle^T$

M-step: Then for fixed $Z'$, the only free parameter in the expected complete data log likelihood $Q = {\mathbb{E}}_{Z|Y} (log p(Z', \Lambda))$ is $\Lambda$.

Therefore,

$Q= \underset{\Lambda}{\text{argmax}}(｛\frac n 2｝ln |\Lambda|-{\frac 1 2} {\sum}_{i=1}^n tr(\lt z_n {z_n}^T\gt \Lambda)-{\frac n 2} \lambda \|\Lambda \|_1 )$
.

This amounts to standard GLASSO optimization with covariance matrix.

RCA-step: After one interation of EM, we update $W$ via RCA based on the newly estimated $\Lambda$,

$\,W= \Sigma S(D-I)^{1/2}$
for the generalized eigen-value problem. $\frac{1}{n}Y^TYS = \Sigma SD$ and $\, \Lambda = \Lambda^{-1}+\sigma^2I$</center>

Iterate until the lower-bound converge.

Experiments

We describe three experiments with EM/RCA and one purely with RCA analysing the residual left from a Gaussian process (GP) in a time-series.

The four experiments with EM/RCA are as following:

Experiment (1), simulation: the authors consider an artificial dataset sampled from the generative model to illustrate the effects of confounders on the estimation of the sparse-inverse covariance.

Figure (a) shows the precision-recall curve for GLASSO and EM/RCA. The EM/RCA curve shows significantly better performance than GLASSO on the confounded data, while the dashed line shows the performance of GLASSO on similarly generated data without the confounding effects $(W = 0)$. We note that EM/RCA performs better on confounded data than GLASSO on non-confounded data, because of the lower signal-to-noise ratio in the non-confounded data.

Experiment (2), reconstruction of a biological network, we applied EM/RCA on the protein-signaling data of <ref name="Sachs2008"> Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., and Nolan, G. P. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308:523– 529, April 2008. </ref>. Figure (b), EM/RCA performs slightly better than all other methods. Figure 4 shows the reconstructed networks for recall 0.4. We note that EM/RCA is more conservative in calling edges.

Experiment (3), reconstruction of human form, the objective here is to reconstruct the underlying connectivity of a human being, given only the 3 dimensional locations of 31 sensors placed about the figures body. The aim is to construct a model which recovers connectivity between these points. EM/RCA method also showed promising result.

Figure 5 shows the comparison in the form of recall/precision curves between GLASSO and the EM/RCA implementation of a sparse-inverse plus lowrank model. As can be seen, the EM/RCA algorithm outperforms the GLASSO. The recovered stickmen of EM/RCA and GLASSO are shown in figure 6.

Experiment (4), differences in gene-expression profiles, the authors applied the RCA method to address the common challenge in data analysis is to summarize the difference between treatment and control samples. Assuming that both time-series are identical, implies $y^T = (y_1^T y_2^T)$ can be modeled by a Gaussian process (GP) with a temporal covariance function, y is distributed as N(0, K), where $K \in R^{n\times n}$ for $n=n_1+n_2$is structured such that both y1 and y2 are generated from the same function. A RBF kernel is used.

For the figure above, (a)RBF kernel computed on augmented time- input vectors of gene-expression. The kernel is computed across times $(0 : 20 : 240, 0, 20, 40, 60, 120, 180, 240)$, jointly for control and treatment. (b) shows the ROC curves of RCA and BATS variants with different noise models. We note that RCA outperforms BATS in terms the area under the ROC curve for all of its noise models.

Discussion

RCA is an algorithm for describing a low-dimensional representation of the residuals of a data set, given partial explanation by a covariance matrix $\Sigma$.The low-rank component of the model can be determined through a generalized eigenvalue problem. The paper illustrated how a treatment and a control time series could have their differences highlighted through appropriate selection of $\Sigma$(in this case we used an RBF kernel). The paper also introduced an algorithm for fitting a variant of CCA where the private spaces are explained through low dimensional latent variables.

Full covariance matrix model is often run into problem as their parameterization scales with $D^2$. This technique combined sparse-inverse covariance (as in GLASSO) with low rank (as in probabilistic PCA) approaches, and have good effect in the experiment. It was demonstrated to good effect in a motion capture and protein network example.

Another way to impose sparsity is to put a prior such as Laplace prior or Jeffrey prior on the covariance matrix in probabilistic PCA, which is closely related to Bayesian PCA.

<references />