regression on Manifold using Kernel Dimension Reduction: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 10: Line 10:
The purpose of Sufficient Dimension Reduction (SDR) is to find a linear subspace S such that the response vector Y is conditionally independent of the covariate vector X. More specifically, let <math>(X,B_X)</math> and <math>(Y,B_Y)</math> be measurable spaces of covariates X and response variable Y. SDR aims to find a linear subspace <math>S \subset X </math> such that <math>S</math> contains as much predictive information about the response <math>Y</math> as the original covariate space. As seen before in (Fukumizu, K., Bach, F. R., & Jordan, M. I. (2004))<ref>Fukumizu, K., Bach, F. R., & Jordan, M. I. (2004):Kernel Dimensionality Reduction for Supervised Learning</ref> this can be written more formally as a conditional independence assertion.
The purpose of Sufficient Dimension Reduction (SDR) is to find a linear subspace S such that the response vector Y is conditionally independent of the covariate vector X. More specifically, let <math>(X,B_X)</math> and <math>(Y,B_Y)</math> be measurable spaces of covariates X and response variable Y. SDR aims to find a linear subspace <math>S \subset X </math> such that <math>S</math> contains as much predictive information about the response <math>Y</math> as the original covariate space. As seen before in (Fukumizu, K., Bach, F. R., & Jordan, M. I. (2004))<ref>Fukumizu, K., Bach, F. R., & Jordan, M. I. (2004):Kernel Dimensionality Reduction for Supervised Learning</ref> this can be written more formally as a conditional independence assertion.


<math>Y \perp B^T X | B^T X </math>.  
<math>Y \perp B^T X | B^T X </math>  <math> \Longleftrightarrow Y \perp (X - B^T X) | B^T X </math>. <ref>[http://www.machinelearning.org/proceedings/icml2007/papers/491.pdf] Jen Nilsson, Fei Sha, Michael I. Jordan, Regression on Manifold using Kernel Dimension Reduction, 2007 - cs.utah.edu
</ref>
   
   
The above statement says that <math>S \subset X</math> such that the conditional probability density function <math>p_{Y|X}(y|x)\,</math> is preserved in the sense that <math>p_{Y|X}(y|x) = p_{Y|B^T X}(y|b^T x)\,</math> for all <math>x \in X \,</math> and <math>y \in Y \,</math>, where <math>B^T X\,</math> is the orthogonal projection of <math>X\,</math> onto <math>S\,</math>. The subspace <math>S\,</math> is referred to as a ''dimension reduction subspace''. Note that <math>S\,</math> is not unique.  
The above statement says that <math>S \subset X</math> such that the conditional probability density function <math>p_{Y|X}(y|x)\,</math> is preserved in the sense that <math>p_{Y|X}(y|x) = p_{Y|B^T X}(y|b^T x)\,</math> for all <math>x \in X \,</math> and <math>y \in Y \,</math>, where <math>B^T X\,</math> is the orthogonal projection of <math>X\,</math> onto <math>S\,</math>. The subspace <math>S\,</math> is referred to as a ''dimension reduction subspace''. Note that <math>S\,</math> is not unique.  
Line 78: Line 79:




One thing to note about the theorem specifically is that it doesn't impose any strong assumptions on the distribution of X, Y or their marginal distribution <math>\mathbb{P}</math>(Y|X). (See Fukumizu et. al. 2006). This theorem leads to the new algorithm for estimating the central subspace characterized by '''B'''.  
One thing to note about the theorem specifically is that it doesn't impose any strong assumptions on the distribution of X, Y or their marginal distribution <math>\mathbb{P}</math>(Y|X). (See Fukumizu et. al. 2006). This theorem leads to the new algorithm for estimating the central subspace characterized by '''B'''. Let <math>\{x_i,y_i\}_{i=1}^{N}</math> denote the N samples from the joint distribution of <math>\mathbb{P}</math>(X, Y) and let <math>K_Y \in \mathbb{R}^{N \times N}</math> and <math>K_Z \in \mathbb{R}^{N \times N}</math> denote the Gram matrices comuted over  {y_<sub>i</sub>} and {z<sub>i</sub> = B<sup>T</sup> x<sub>i</sub>}. Then Fukumizu et al. (2006) <ref>Fukumizu, K., Bach, F. R., & Jordan, M. I. (2006). Kernel dimension reduction in regression ''(Technical Report)''. Department of Statistics, University of California, Berkeley.</ref> show that this problem can be formulated as
 
min Tr <math>\mathbb{[} K_Y^C(K_Z^C + N \in I) \mathbb{]}</math>
 
such that  <math>B^T B = I</math>


===Manifold Learning===
===Manifold Learning===

Revision as of 00:51, 28 July 2009

An Algorithm for finding a new linear map for dimension reduction.

Introduction

This paper <ref>[1] Jen Nilsson, Fei Sha, Michael I. Jordan, Regression on Manifold using Kernel Dimension Reduction, 2007 - cs.utah.edu </ref>introduces a new algorithm for discovering a manifold that best preserves the information relevant to a non-linear regression. The approach introduced by the authors involves combining the machinery of Kernel Dimension Reduction (KDR) with Laplacian Eigenmaps by optimizing the cross-covariance operators in kernel feature space.

Two main challenges that we usually come across in supervised learning are making a choice of manifold to represent the covariance vector and to choose a function to represent the boundary for classification (i.e. regression surface). As a result of these two complexities, most of the research in supervised learning has been focused on learning linear manifolds. The authors introduce a new algorithm that makes use of methodologies developed in Sufficient Dimension Reduction (SDR) and Kernel Dimension Reduction (KDR). The algorithm is called Manifold Kernel Dimension Reduction (mKDR).

Sufficient Dimension Reduction

The purpose of Sufficient Dimension Reduction (SDR) is to find a linear subspace S such that the response vector Y is conditionally independent of the covariate vector X. More specifically, let [math]\displaystyle{ (X,B_X) }[/math] and [math]\displaystyle{ (Y,B_Y) }[/math] be measurable spaces of covariates X and response variable Y. SDR aims to find a linear subspace [math]\displaystyle{ S \subset X }[/math] such that [math]\displaystyle{ S }[/math] contains as much predictive information about the response [math]\displaystyle{ Y }[/math] as the original covariate space. As seen before in (Fukumizu, K., Bach, F. R., & Jordan, M. I. (2004))<ref>Fukumizu, K., Bach, F. R., & Jordan, M. I. (2004):Kernel Dimensionality Reduction for Supervised Learning</ref> this can be written more formally as a conditional independence assertion.

[math]\displaystyle{ Y \perp B^T X | B^T X }[/math] [math]\displaystyle{ \Longleftrightarrow Y \perp (X - B^T X) | B^T X }[/math]. <ref>[2] Jen Nilsson, Fei Sha, Michael I. Jordan, Regression on Manifold using Kernel Dimension Reduction, 2007 - cs.utah.edu </ref>

The above statement says that [math]\displaystyle{ S \subset X }[/math] such that the conditional probability density function [math]\displaystyle{ p_{Y|X}(y|x)\, }[/math] is preserved in the sense that [math]\displaystyle{ p_{Y|X}(y|x) = p_{Y|B^T X}(y|b^T x)\, }[/math] for all [math]\displaystyle{ x \in X \, }[/math] and [math]\displaystyle{ y \in Y \, }[/math], where [math]\displaystyle{ B^T X\, }[/math] is the orthogonal projection of [math]\displaystyle{ X\, }[/math] onto [math]\displaystyle{ S\, }[/math]. The subspace [math]\displaystyle{ S\, }[/math] is referred to as a dimension reduction subspace. Note that [math]\displaystyle{ S\, }[/math] is not unique.

We can define a minimal subspace as the intersection of all dimension reduction subspaces [math]\displaystyle{ S\, }[/math]. However, a minimal subspaces will not necessarily satisfy the conditional independence assertion specified above. But when it does, it is referred to as the central subspace.

This is one of the primary goals of the method i.e. to find a central subspace. Several approaches have been introduced in the past, mostly based on inverse regression (Li, 1991)<ref>Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 316–327</ref> (Li, 1992)<ref>On principal Hessian directions for data visualization and dimension reduction: Another application of Stein’s lemma. Journal of the American Statistical Association, 86, 316–342.</ref>. The main intuition behind this approach is to find [math]\displaystyle{ \mathbb{E[} X|Y \mathbb{]} }[/math] becuase if the the forward regression model [math]\displaystyle{ P(X|Y) }[/math] is concenterated in a subspace of [math]\displaystyle{ X }[/math], then [math]\displaystyle{ \mathbb{E[} X|Y \mathbb{]} }[/math] should also lie in [math]\displaystyle{ X }[/math] (See Li. 1991 for more details<ref>Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 316–327</ref>). Unfortunately, such an approach proposes a difficulty of making strong assumptions on the distribution of X (e.g. the distribution should be elliptical) and the methods of inverse regression fail if such assumptions are not satisfied. In order to overcome this problem, the authors turn to the description of KDR, i.e. an approach to SDR which does not make such strong assumptions.


Kernel Dimension Reduction

The framework for Kernel Dimension Reduction was primarily described by Kenji Fukumizu <ref>Fukumizu, K., Bach, F. R., & Jordan, M. I. (2004). Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5, 73–99.</ref><ref>Fukumizu, K., Bach, F. R., & Jordan, M. I. (2006). Kernel dimension reduction in regression (Technical Report). Department of Statistics, University of California, Berkeley.</ref>. The key idea behind KDR is to map random variables X and Y to Reproducing Kernel Hilbert Spaces (RHKS). Before going ahead, we make some preliminary definitions.


[math]\displaystyle{ \mathbb{D} }[/math]:- Reproducing Kernel Hilbert Space A Hilbert space is a (possibly infinite dimension) inner product space that is a complete metric space. Elements of a Hilbert space may be functions. A reproducing kernel Hilbert space is a Hilbert space of functions on some set [math]\displaystyle{ T }[/math] such that there exists a function [math]\displaystyle{ K }[/math] (known as the reproducing kernel) on [math]\displaystyle{ T \times T }[/math], where for any [math]\displaystyle{ t \in T }[/math], [math]\displaystyle{ K( \cdot , t ) }[/math] is in the RKHS.


[math]\displaystyle{ \mathbb{D} }[/math]:- Cross-Covariance Operators Let [math]\displaystyle{ ({ H}_1, k_1) }[/math] and [math]\displaystyle{ ({H}_2, k_2) }[/math] be RKHS over [math]\displaystyle{ (\Omega_1, { B}_1) }[/math] and [math]\displaystyle{ (\Omega_2, {B}_2) }[/math], respectively, with [math]\displaystyle{ k_1 }[/math] and [math]\displaystyle{ k_2 }[/math] measurable. For a random vector [math]\displaystyle{ (X, Y) }[/math] on [math]\displaystyle{ \Omega_1 \times \Omega_2 }[/math]. Using the Reisz representation theorem, one may show that there exists a unique operator [math]\displaystyle{ \Sigma_{YX} }[/math] from [math]\displaystyle{ H_1 }[/math] to [math]\displaystyle{ H_2 }[/math] such that
[math]\displaystyle{ \lt g, \Sigma_{YX} f\gt _{H_2} = \mathbb{E}_{XY} [f(X)g(Y)] - \mathbb{E}[f(X)]\mathbb{E}[g(Y)] }[/math]
holds for all [math]\displaystyle{ f \in H_1 }[/math] and [math]\displaystyle{ g \in H_2 }[/math], which is called the cross-covariance operator.


[math]\displaystyle{ \mathbb{D} }[/math]:- Condtional Covariance Operators Let [math]\displaystyle{ (H_1, k_1) }[/math] and [math]\displaystyle{ (H_2, k_2) }[/math] be RKHS on [math]\displaystyle{ \Omega_1 \times \Omega_2 }[/math], and let [math]\displaystyle{ (X,Y) }[/math] be a random vector on measurable space [math]\displaystyle{ \Omega_1 \times \Omega_2 }[/math]. The conditonal cross-covariance operator of [math]\displaystyle{ (Y,Y) }[/math] given [math]\displaystyle{ X }[/math] is defined by
[math]\displaystyle{ \Sigma_{YY|x}: = \Sigma_{YY} - \Sigma_{YX}\Sigma_{XX}^{-1}\Sigma_{XY} }[/math].


[math]\displaystyle{ \mathbb{D} }[/math] :- Conditional Covariance Operators and Condtional Indpendence Let [math]\displaystyle{ (H_{11}, k_{11}) }[/math], [math]\displaystyle{ (H_{12},k_{12}) }[/math] and [math]\displaystyle{ (H_2, k_2) }[/math] be RKHS on measurable space [math]\displaystyle{ \Omega_{11} }[/math], [math]\displaystyle{ \Omega_{12} }[/math] and [math]\displaystyle{ \Omega_2 }[/math], respectively, with continuous and bounded kernels. Let [math]\displaystyle{ (X,Y)=(U,V,Y) }[/math] be a random vector on [math]\displaystyle{ \Omega_{11}\times \Omega_{12} \times \Omega_{2} }[/math], where [math]\displaystyle{ X = (U,V) }[/math], and let [math]\displaystyle{ H_1 = H_{11} \otimes H_{12} }[/math] be the dirct product. It is assume that [math]\displaystyle{ \mathbb{E}_{Y|U} [g(Y)|U= \cdot] \in H_{11} }[/math] and [math]\displaystyle{ \mathbb{E}_{Y|X} [g(Y)|X= \cdot] \in H_{1} }[/math] for all [math]\displaystyle{ g \in H_2 }[/math]. Then we have
[math]\displaystyle{ \Sigma_{YY|U} \ge \Sigma_{YY|X} }[/math],
where the inequality refers to the order of self-adjoint operators. Furthermore, if [math]\displaystyle{ H_2 }[/math] is probability-deremining,
[math]\displaystyle{ \Sigma_{YY|X} = \Sigma_{YY|U} \Leftrightarrow Y \perp X|U }[/math].
Therefore, the effective subspace S can be found by minimizing the following function:
[math]\displaystyle{ \min_S\quad \Sigma_{YY|U} }[/math],[math]\displaystyle{ s.t. \quad U = \Pi_S X }[/math].

Note here for
[math]\displaystyle{ \Sigma_{YY|U} \ge \Sigma_{YY|X} }[/math],
in the sense of operator, the inequality means the variance of [math]\displaystyle{ Y }[/math] given data [math]\displaystyle{ U }[/math] is bigger than the variance of [math]\displaystyle{ Y }[/math] given data [math]\displaystyle{ X }[/math], which makes sense that [math]\displaystyle{ U }[/math] is just a part of the whole data [math]\displaystyle{ X }[/math].


Now that we have defined cross-covariance operators, we are finally ready to link the cross covariance operators to the central subspace. Consider any subspace [math]\displaystyle{ S \in X }[/math]. Then we can map this subspace to a RKHS [math]\displaystyle{ H_S }[/math] with a kernel function [math]\displaystyle{ K_S }[/math]. Furthermore, we define the conditional cross covariance operator as [math]\displaystyle{ \Sigma_{YY|S} }[/math] as if we were to regress [math]\displaystyle{ Y }[/math] on [math]\displaystyle{ S }[/math]. Then, intuitively, the residual error from [math]\displaystyle{ \Sigma_{YY|S} }[/math] should be greater than that from [math]\displaystyle{ \Sigma_{YY|S} }[/math]. Fukumizu etl al. (2006) <ref>Fukumizu, K., Bach, F. R., & Jordan, M. I. (2006). Kernel dimension reduction in regression (Technical Report). Department of Statistics, University of California, Berkeley.</ref> formalized that it would be trues unless [math]\displaystyle{ S }[/math] contains the central subspace. The intuition is formalized in the following theorem.

[math]\displaystyle{ \mathfrak{Theorem 1:-} }[/math] Suppose [math]\displaystyle{ Z = B^T B X \in S }[/math] where [math]\displaystyle{ B \in \mathbb{R}^{D \times d} }[/math] is a projection matrix such that [math]\displaystyle{ B^T B }[/math] is an identity matrix. Further assume Gaussian RBF kernels for [math]\displaystyle{ K_X, K_Y, and K_S }[/math]. Then

-) [math]\displaystyle{ \Sigma_{YY|X} \prec \Sigma_{YY|Z} }[/math] where [math]\displaystyle{ \prec }[/math] stands for "less than or equal to" in some operator partial ordering.

-) [math]\displaystyle{ \Sigma_{YY|X} = \Sigma_{YY|Z} }[/math] if and only if [math]\displaystyle{ Y \bot (X - B^T X)|B^T X }[/math], that is, [math]\displaystyle{ S }[/math] is a central subspace


One thing to note about the theorem specifically is that it doesn't impose any strong assumptions on the distribution of X, Y or their marginal distribution [math]\displaystyle{ \mathbb{P} }[/math](Y|X). (See Fukumizu et. al. 2006). This theorem leads to the new algorithm for estimating the central subspace characterized by B. Let [math]\displaystyle{ \{x_i,y_i\}_{i=1}^{N} }[/math] denote the N samples from the joint distribution of [math]\displaystyle{ \mathbb{P} }[/math](X, Y) and let [math]\displaystyle{ K_Y \in \mathbb{R}^{N \times N} }[/math] and [math]\displaystyle{ K_Z \in \mathbb{R}^{N \times N} }[/math] denote the Gram matrices comuted over {y_i} and {zi = BT xi}. Then Fukumizu et al. (2006) <ref>Fukumizu, K., Bach, F. R., & Jordan, M. I. (2006). Kernel dimension reduction in regression (Technical Report). Department of Statistics, University of California, Berkeley.</ref> show that this problem can be formulated as

min Tr [math]\displaystyle{ \mathbb{[} K_Y^C(K_Z^C + N \in I) \mathbb{]} }[/math]

such that [math]\displaystyle{ B^T B = I }[/math]

Manifold Learning

(this section will be updated shortly)

Manifold Kernel Dimension Reduction

(this section will be updated shortly)

Further Research

(this section will be updated shortly)

References

(this section will be updated shortly)