residual Component Analysis: Generalizing PCA for more flexible inference in linear-Gaussian models

From statwiki
Jump to navigation Jump to search

Introduction

Probabilistic principle component analysis (PPCA) decomposes the covariance of a data vector [math]\displaystyle{ y }[/math] in [math]\displaystyle{ \mathbb{R}^p }[/math], into a low-rank term and a spherical noise term.

[math]\displaystyle{ y \sim \mathcal{N} (0, WW^T+\sigma I ) }[/math]

[math]\displaystyle{ W \in \mathbb{R}^{p \times q} }[/math] such that [math]\displaystyle{ q \lt p-1 }[/math] imposes a reduced rank structure on the covariance. The log-likelihood of the centered dataset [math]\displaystyle{ Y }[/math] in [math]\displaystyle{ \mathbb{R}^{n \times p} }[/math] with n data points and p features

[math]\displaystyle{ ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{i,:}|0, XX^T+\sigma^2 I) }[/math]

can be maximized<ref name="tipping1999">

Tipping, M. E. and Bishop, C.M. Probabilistic principle component analysis. Journal of the Royal Statistical Society. Series B(Statistical Methodology), 61(3):611-622,1999

</ref> with the result

[math]\displaystyle{ W_{ML} = U_qL_qR^T }[/math]

where [math]\displaystyle{ U_q }[/math] are [math]\displaystyle{ q }[/math] principle eigenvectors of the sample covariance [math]\displaystyle{ \tilde S }[/math], with [math]\displaystyle{ \tilde S = n^{-1}Y^TY }[/math] and [math]\displaystyle{ L^q }[/math] is a diagonal matrix with elements [math]\displaystyle{ l_{i,i} = (\lambda_i - \sigma^2)^{1/2} }[/math], where [math]\displaystyle{ \lambda_i }[/math] is the ith eigenvalue of the sample covariance and [math]\displaystyle{ \sigma^2 }[/math] is the noise variance. This max-likelihood solution is rotation invariant; [math]\displaystyle{ R }[/math] is an arbitrary rotation matrix. The matrix [math]\displaystyle{ W }[/math] spans the principle subspace of the data and the model is known as probabilistic PCA.

The underlying assumption of the model is that the data set can be represented by [math]\displaystyle{ Y = XW^T+E }[/math] where [math]\displaystyle{ X }[/math] in [math]\displaystyle{ \mathbb{R}^{n \times p} }[/math] is a matrix of [math]\displaystyle{ q }[/math] dimensional latent variables and [math]\displaystyle{ E }[/math] is a matrix of noise variables [math]\displaystyle{ e_{ij} \sim \mathcal{N} (0,\sigma^2) }[/math]. The marginal log-likelihood above is obtained by placing an isotropic prior independently on the elements of [math]\displaystyle{ X }[/math] with [math]\displaystyle{ x_{ij} \sim \mathcal{N}(0,1) }[/math].

It is shown<ref name="lawerence2005"> Lawrence N.D. Probabilistic non-linear principle component analysis with Gaussian process latent variable models. Journal of Machine Learning . MIT Press, Cambridge, MA, 2006.

</ref> that the PCA solution is also obtained for log-likelihoods of the form

[math]\displaystyle{ ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{:,j}|0, XX^T+\sigma^2 I) }[/math]

This is recovered when we marginalize the loadings [math]\displaystyle{ W }[/math], instead of latent variable [math]\displaystyle{ X }[/math], with a Gaussian isotropic prior. This is the dual form of probabilistic PCA. This is analogous to the Dual form of PCA and similarly to the primal form, the max likelihood solution solves for the latent coordinates [math]\displaystyle{ X_{ML} = U^'_q L_qR^T }[/math], instead of the principle subspace basis. Here, [math]\displaystyle{ U^'_q }[/math] are the first [math]\displaystyle{ q }[/math] principle eigenvectors of the inner product matrix [math]\displaystyle{ p^{-1}YY^T }[/math] with [math]\displaystyle{ Lq }[/math] define as before. Both primal and dual scenarios involve maximizing likelihoods of a similar covariance structure, namely when the covariance of the Gaussians is given by a low-rank term plus a spherical term. This paper considers a more general form

[math]\displaystyle{ XX^T+\Sigma }[/math]

Where [math]\displaystyle{ \Sigma }[/math] is a general positive definite matrix. The log-likelihood of this general problem is given by

[math]\displaystyle{ ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{:,j}|0, XX^T+\Sigma) }[/math]

where [math]\displaystyle{ \Sigma = ZZ^T+ \sigma^2I }[/math]. The underlying model of this log-liklihood function can be considered as a linear mixed effect model with two factors and noise,

[math]\displaystyle{ Y = XW^T+ZV^T+E }[/math]

where [math]\displaystyle{ Z }[/math] is a matrix of known covariates and [math]\displaystyle{ X }[/math] is a matrix of latent variables.


The question this papers attempts to answer is that given [math]\displaystyle{ \Sigma }[/math] how can we solve for [math]\displaystyle{ X }[/math]( respectively [math]\displaystyle{ W }[/math]), and for what values of [math]\displaystyle{ \Sigma }[/math] we can formulate useful new algorithm for machine learning? This paper shows that the maximum likelihood solution for [math]\displaystyle{ X }[/math] is simply based on generalized eigenvalue problem (GEP) on the sample-covariance matrix. Hence the low-rank term [math]\displaystyle{ XX^T }[/math] can be optimized for general [math]\displaystyle{ \Sigma }[/math]. This authors call this approach residual component analysis (RCA).

Maximum likelihood RCA

Low Rank Plus Sparse Inverse

Experiments

Discussion

Full covariance matrix model is often run into problem as their parameterization scales with [math]\displaystyle{ D^2 }[/math]. This technique combined sparse-inverse covariance (as in GLASSO) with low rank (as in probabilistic PCA) approaches, and have good effect in the experiment.