probabilistic PCA with GPLVM: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 69: Line 69:
<center><math>\mathrm{KL}(N(\mathbf{z|0,S)} || N(\mathbf{z|0,K})) = \frac{1}{2} ln |\mathbf{K}| - \frac{1}{2} ln |\mathbf{S}| + \frac{1}{2} tr(\mathbf{SK^{-1}}) - \frac{N}{2}</math></center>
<center><math>\mathrm{KL}(N(\mathbf{z|0,S)} || N(\mathbf{z|0,K})) = \frac{1}{2} ln |\mathbf{K}| - \frac{1}{2} ln |\mathbf{S}| + \frac{1}{2} tr(\mathbf{SK^{-1}}) - \frac{N}{2}</math></center>


,where <math>\,S</math> is a matrix of similarities like those used in methods like [http://en.wikipedia.org/wiki/Multidimensional_scaling MDS] and [http://en.wikipedia.org/wiki/Kernel_PCA kernel PCA]. In fact, if <math>\,S = D^{-1}YY^T</math>, then this KL divergence is identical to the log-likelihood objective function <math>\,L</math> given above up to a scaling factor of <math>\,-D</math>. 
,where <math>\,S</math> is a matrix of similarities like those used in methods like [http://en.wikipedia.org/wiki/Multidimensional_scaling MDS] and [http://en.wikipedia.org/wiki/Kernel_PCA kernel PCA].  


This KL divergence thus provides a unifying objective function that covers PCA, kernel PCA, PPCA and the GPLVM.
This KL divergence provides a unifying objective function that covers PCA, kernel PCA, PPCA and the GPLVM, since when the values for <math>\,S</math> and <math>\,K</math> are appropriately chosen, it is a valid objective function for PCA, kernel PCA, classical
MDS and the GP-LVM. In the case of kernel PCA, <math>\,S</math> is a non-linear kernel and <math>\,K</math> is the linear kernel and, in the case of the GP-LVM, it is the other way around.


To solve it, the author uses gradient-based optimization, and uses deterministic dimensionality reduction to generate initial values.
To solve it, the author uses gradient-based optimization, and uses deterministic dimensionality reduction to generate initial values.

Revision as of 22:53, 4 December 2010

Overview

The development of principal component analysis (PCA) (a tutorial of it by Lindsay I. Smith can be found here), a statistical tool which may be used for dimensionality reduction, spans from the initial formulation of PCA 100 years ago to the more recent formulation of the dual problem (Dual PCA) and subsequent kernelization of the PCA algorithm (Kernel PCA). Inspired by this progression of the deterministic formulation of PCA, Neil Lawrence builds on a probabilistic PCA model (PPCA) developed by Tipping and Bishop, and proposes a novel dual formulation of PPCA and subsequently the Gaussian Process Latent Variable Model (GPLVM). A graphical representation of this can be see in Figure 1.

Fig. 1: Overview of the GPLVM Formulation

While the probabilistic approach to dimensionality reduction - latent variable model (Jeroen K. Vermunt gives details regarding this type of model here) - accomplishes the same task of dimensionality reduction, the paradigm is not entirely the same. With dimensionality reduction, we have some high-dimensional observations, and we wish to learn a functional mapping which will transform the data from high-dimensional space into a lower-dimensional, embedded space. With latent variable models, however, we specify a functional form [math]\displaystyle{ \,y = g(x) + \epsilon }[/math], where [math]\displaystyle{ \,y }[/math] is a high-dimensional original data point, [math]\displaystyle{ \,x }[/math] is the low-dimensional mapping of [math]\displaystyle{ \,y }[/math], and [math]\displaystyle{ \,\epsilon }[/math] is a noise model. We then estimate the parameters using sampling techniques or expectation maximization (Sean Borman provides a tutorial of it here). Similar to the deterministic case, the probabilistic formulation has a linear form when [math]\displaystyle{ \,g }[/math] is linear, and a nonlinear form when [math]\displaystyle{ \,g }[/math] is nonlinear. Examples of probabilistic linear latent variable modelling include Probabilistic PCA, which will be discussed next, and nonlinear techniques include Generative Topographic Modelling (GTM). It is important to note, however, that the GPLVM does not define [math]\displaystyle{ \,g }[/math] directly; instead, [math]\displaystyle{ \,g }[/math] is defined via the kernel trick (Carlos C. Rodriguez gives details regarding it here).

Probabilistic PCA

In order to understand the GPLVM, we must first consider the previous work done by Tipping and Bishop in developing the PPCA. Consider a set of centered data with [math]\displaystyle{ \,n }[/math] observations and [math]\displaystyle{ \,d }[/math] dimensions, [math]\displaystyle{ \mathbf{Y} = [y_1,...,y_n]^T }[/math].

We can view this data as having a linear relationship with some embedded latent-space data [math]\displaystyle{ \mathbf{x_n} }[/math], assuming Gaussian noise, as [math]\displaystyle{ \mathbf{y_n} = \mathbf{Wx_n} + \mathbf{n} }[/math], where [math]\displaystyle{ \,x_n }[/math] is the [math]\displaystyle{ \,q }[/math]-dimensional latent variable associated with each observation, and [math]\displaystyle{ W \in \mathbb{R}^{D\times q} }[/math] is the transformation matrix which relates the observed space to the latent space. Assuming a spherical Gaussian distribution for the noise with a mean of zero and a covariance of [math]\displaystyle{ \beta^{-1}\mathbf{I} }[/math], the likelihood for a particular observation [math]\displaystyle{ \,y_n }[/math] can be expressed as [math]\displaystyle{ p(\mathbf{y_n | x_n, W},\beta) = N(\mathbf{y_n | Wx_n},\beta^{-1}\mathbf{I}) }[/math], where [math]\displaystyle{ \,N }[/math] denotes the distribution function of a multivariate normal distribution for [math]\displaystyle{ \,y_n }[/math] with, in this case, mean [math]\displaystyle{ \,Wx_n }[/math] and covariance matrix [math]\displaystyle{ \beta^{-1}\mathbf{I} }[/math].

Thus, we define a conditional (marginal) probability model for the high dimensional data and, in doing so, specify a linear relationship between the latent space and data space, along with some Gaussian noise. Marginalizing both hidden variables together ([math]\displaystyle{ \,x_n }[/math] and [math]\displaystyle{ \,W }[/math]) is intractable, so we then take the standard approach of marginalizing [math]\displaystyle{ \,x_n }[/math], putting a Gaussian prior on [math]\displaystyle{ \mathbf{W} }[/math] and solving using maximum likelihood.

Then the marginal likelihood of [math]\displaystyle{ \,y_n }[/math] (after marginalizing [math]\displaystyle{ \,x_n }[/math]) is described as [math]\displaystyle{ p(\mathbf{y_n | W},\beta) = N(\mathbf{y_n |0, WW^{T}} +\beta^{-1}\mathbf{I}) }[/math]. Assuming that the data is i.i.d., the likelihood of the full set is the product of the individual probabilities, i.e. it is [math]\displaystyle{ p(\mathbf{Y | W},\beta) = \prod^{N}_{n=1} p(\mathbf{y_n | W},\beta) }[/math].

We can then find a solution for [math]\displaystyle{ \,W }[/math] by maximizing this probability. Tipping and Bishop did so in 1999; they found that the closed-form solution for [math]\displaystyle{ \,W }[/math] is achieved when [math]\displaystyle{ \,W }[/math] spans the principal sub-space of the data. This is the same answer that we arrive at with PCA, where we project the data onto the subspace which maximizes the variance of the data; consequently, this model can be interpreted as a probabilistic version of PCA, since the solution is obtained by maximizing a probability.

This is an acceptable formulation; however, it is only a linear model. Can it be extended to capture non-linear features? The following re-formulation as a dual and the subsequent presentation of the GPLVM will show that this is in fact possible.

Dual PPCA

Dual PPCA is formulated by optimizing the latent variable [math]\displaystyle{ \,x_n }[/math], and marginalizing [math]\displaystyle{ \mathbf{W} }[/math]. It, in essence, follows the same steps as the work done by Bishop and Tipping. In this case, we start by defining a conjugate prior, which is [math]\displaystyle{ P(\mathbf{W}) = \prod^{D}_{d=1} N(\mathbf{W_d|0,I}) }[/math], where [math]\displaystyle{ \,w_d }[/math] is the [math]\displaystyle{ \,d }[/math]th row of [math]\displaystyle{ \,W }[/math] (i.e. it is [math]\displaystyle{ \,w_{d,:} }[/math]).

Because it is straightforward due to using a conjugate prior, we marginalise [math]\displaystyle{ \mathbf{W} }[/math] rather than marginalise [math]\displaystyle{ \,X = [x_1 \dots x_N]^T }[/math]. By doing so, the resulting marginalised likelihood is [math]\displaystyle{ P(Y|X,\beta) = \prod^{D}_{d=1} p(y_{:,d}|X,\beta) }[/math], where [math]\displaystyle{ \,y_{:,d} }[/math] is the [math]\displaystyle{ \,d }[/math]th column of [math]\displaystyle{ \,Y }[/math] and [math]\displaystyle{ p(\mathbf{y_{:,d} | X},\beta) = N(y_{:,d} | 0 , XX^T + \beta^{-1}I) }[/math].

This is very similar to what we saw previously, with PPCA. Here, each dimension is a multivariate normal with a covariance specified by the dot product-matrix. Assuming each dimension is independent, we can then obtain the joint likelihood of the data matrix [math]\displaystyle{ \,Y }[/math] as a product of each dimension, and the log-likelihood objective function is then [math]\displaystyle{ L = -\frac{DN}{2}\text{ln}2\pi - \frac{D}{2}\text{ln}|\mathbf{K}| - \frac{1}{2}\text{tr}(\mathbf{K^{-1}YY^T}) }[/math], where the linear kernel [math]\displaystyle{ \,K }[/math] is [math]\displaystyle{ \mathbf{K = XX^T} + \beta^{-1}\mathbf{I} }[/math].

Gradient descent is then applied to this objective function [math]\displaystyle{ \,L }[/math]. Magnus and Neudecker, in 1999, found that the gradients of [math]\displaystyle{ \,L }[/math] with respect to [math]\displaystyle{ \,X }[/math] are [math]\displaystyle{ \,\frac{\delta L}{\delta X} = K^{-1}YY^TK^{-1}X - DK^{-1}X }[/math]. A point where the gradients are all zero can be found by setting [math]\displaystyle{ \,\frac{1}{D}YY^TK^{-1}X = X }[/math]. Solving this, a solution is obtained in the form of an eigenvalue decomposition problem [math]\displaystyle{ \mathbf{X = ULV^T} }[/math]. Here, [math]\displaystyle{ \mathbf{U} }[/math] is an [math]\displaystyle{ \,N \times q }[/math] matrix, with the columns being the first [math]\displaystyle{ \,q }[/math] eigenvectors of the matrix [math]\displaystyle{ \mathbf{YY^T} }[/math], [math]\displaystyle{ \,L }[/math] is a [math]\displaystyle{ \,q \times q }[/math] diagonal matrix whose [math]\displaystyle{ \,j }[/math]th element is [math]\displaystyle{ \,l_j = (\lambda_j - \frac{1}{\beta})^{-\frac{1}{2}} }[/math], where [math]\displaystyle{ \,\lambda_j }[/math] is the eigenvalue associated with the [math]\displaystyle{ \,j }[/math]th eigenvector of [math]\displaystyle{ \,D^{−1}YY^T }[/math] and [math]\displaystyle{ \,V }[/math] is a [math]\displaystyle{ \,q \times q }[/math] rotation matrix that is chosen arbitrarily. Thus, the solution of this likelihood function will then be the projection of the data points onto their principal subspace.


Note that in this formulation, each dimension of the marginal distribution can be interpreted as a Gaussian process, and each dimension of the joint marginal is a product of Gaussian processes. Thus, the objective function is expressed in terms of a linear kernel covariance.

The advantage of this dual formulation is two-fold:

  • The eigendecomposition is done on an [math]\displaystyle{ \,N \times q }[/math] matrix instead of a [math]\displaystyle{ \,d \times q }[/math] matrix, where [math]\displaystyle{ \,N \lt \lt d }[/math].
  • [math]\displaystyle{ \,U }[/math] is now an inner product matrix, which can be kernelized.

Through the selection of an appropriate kernel, this technique can now be used to solve a much broader range of dimensionality reduction problems instead of being restricted to those with linear features.

GP-LVM

GP-LVM can be accomplished by replacing the inner product kernel with one that allows for non-linear functions. We use Gaussian processes to map from a latent space, [math]\displaystyle{ \,X }[/math], to the observed-data space, [math]\displaystyle{ \,Y }[/math].

In the above, we used a Gaussian prior over the space of functions that are fundamentally linear (but are corrupted by Gaussian noise having variance [math]\displaystyle{ \,\beta^{−1}I }[/math]. This Gaussian prior's covariance function, or kernel, is [math]\displaystyle{ k(x_i,x_j) = x_i^Tx_j + \beta^{-1} \delta_{i,j} }[/math], where [math]\displaystyle{ \,x_i }[/math] and [math]\displaystyle{ \,x_j }[/math] are input vectors and [math]\displaystyle{ \,\delta_{i,j} }[/math] is the Kronecker delta. Indeed, if the inputs are from [math]\displaystyle{ \,X }[/math] and the covariance function is evaluated at each data point, then we would recover the covariance matrix [math]\displaystyle{ \,K = XX^T + \beta^{-1}I }[/math] whose [math]\displaystyle{ \,(i,j) }[/math]-th element is [math]\displaystyle{ \,k(x_i,x_j) }[/math] as given above.

Instead of using this linear kernel, we can consider arbitrary non-linear kernels such as the Radial Basis Function (RBF) kernel [math]\displaystyle{ k(\mathbf{x_i,x_j}) = \theta_{\mathrm{rbf}} e^{\frac{-\gamma}{2}(\mathbf{x_i - x_j)}^{T}(\mathbf{x_i - x_j)}} + \theta_{\mathrm{bias}} + \theta_{\mathrm{white}}\delta_{i,j} }[/math].

This kernel leads to a non-linear probabilistic latent variable model - the GPLVM. However, with these non-linear kernels, we no longer have a closed-form solution.

Unifying Objective Function

We must now solve a non-convex objective function. This is is based on minimizing the following Kullback-Leibler (KL) divergence between the two Gaussians which are induced by kernel matrices on the latent and data spaces:

[math]\displaystyle{ \mathrm{KL}(N(\mathbf{z|0,S)} || N(\mathbf{z|0,K})) = \frac{1}{2} ln |\mathbf{K}| - \frac{1}{2} ln |\mathbf{S}| + \frac{1}{2} tr(\mathbf{SK^{-1}}) - \frac{N}{2} }[/math]

,where [math]\displaystyle{ \,S }[/math] is a matrix of similarities like those used in methods like MDS and kernel PCA.

This KL divergence provides a unifying objective function that covers PCA, kernel PCA, PPCA and the GPLVM, since when the values for [math]\displaystyle{ \,S }[/math] and [math]\displaystyle{ \,K }[/math] are appropriately chosen, it is a valid objective function for PCA, kernel PCA, classical MDS and the GP-LVM. In the case of kernel PCA, [math]\displaystyle{ \,S }[/math] is a non-linear kernel and [math]\displaystyle{ \,K }[/math] is the linear kernel and, in the case of the GP-LVM, it is the other way around.

To solve it, the author uses gradient-based optimization, and uses deterministic dimensionality reduction to generate initial values. So, not only is this nonconvex, but each evaluation of the objective function is very costly, requiring the inversion of the kernel matrix, an operation of the order [math]\displaystyle{ O(N^3) }[/math].

To speed up the optimization process, the author presents sparsification techniques to infer which subset of the data is deemed to contain the most information.

Results

The author empirically evaluates the GPLVM on several datasets. He begins with an experiment using simulated oil data, where each point is 12 dimensional, and the points can be classified into one of three flow rates---stratified, annular and homogeneous. However, the author uses only a reduced dataset of 100 points in order to make the problem computationally tractable. If we evaluate the performance of the GPLVM against PCA through classification error, using a nearest-neighbour classifier, PCA has 14 errors, while the GPLVM has four errors.

Thus, while there is some improvement in classification, it is not remarkable. This is likely due, in part, to the fact that PCA is used to initialize the GPLVM algorithm.

The author then applies the sparsified GPLVM to all 1000 oil data points; he compares it with GTM and PCA. We see that the sparsification process required to make the GPLVM tractable degrades the performance considerably.

However, while the paper shows that the GPLVM does not provide a significantly better method for classification, other domains have already put the model to use. For example, immediately after this paper was published, other researchers in the field of animation (style-based inverse kinematics) used the GPLVM on motion capture data to interpolate natural poses.

Conclusion

The GPLVM is a novel approach to probabilistic latent embedding, which is distinct from other approaches. It is probabilistic and can therefore be used to evaluate likelihoods, handle missing data and illustrate areas of confidence in the latent embedding. Further, it is nonlinear, and can consider different types of noise and kernel types. However, solving the optimization problem is difficult. It does not learn manifolds effectively, and it is computationally expensive so it does not scale well. Ultimately, however, the paper has already proved useful, particularly for missing data problems.