regression on Manifold using Kernel Dimension Reduction

An Algorithm for finding a new linear map for dimension reduction.

Introduction

This paper <ref>[1] Jen Nilsson, Fei Sha, Michael I. Jordan, Regression on Manifold using Kernel Dimension Reduction, 2007 - cs.utah.edu </ref>introduces a new algorithm for for discovering a manifold that best preserves the information relevant to a non-linear regression. The approach introduced by the authors involves combining the machinery of Kernel Dimension Reduction (KDR) with Laplacian Eigenmaps by optimizing the cross-covariance operators in kernel feature space.

Two main challenges that we usually come across in supervised learning are making a choice of manifold to represent the covariance vector and to choose a function to represent the boundary for classification (i.e. regression surface). As a result of these two complexities, most of the research in supervised learning has been focused on learning linear manifolds. The authors introduce a new algorithm that makes use of methodologies developed in Sufficient Dimension Reduction (SDR) and Kernel Dimension Reduction (KDR). The algorithm is called Manifold Kernel Dimension Reduction (mKDR).

Sufficient Dimension Reduction

The purpose of Sufficient Dimension Reduction (SDR) is to find a linear subspace S such that the response vector Y is conditionally independent of the covariate vector X. More specifically, let [math]\displaystyle{ (X,B_X) }[/math] and [math]\displaystyle{ (Y,B_Y) }[/math] be measurable spaces of covariates X and response variable Y. SDR aims to find a linear subspace [math]\displaystyle{ S \subset X }[/math] such that [math]\displaystyle{ S }[/math] contains as much predictive information about the response [math]\displaystyle{ Y }[/math] as the original covariate space. As seen before in (Fukumizu, K., Bach, F. R., & Jordan, M. I. (2004))<ref>Fukumizu, K., Bach, F. R., & Jordan, M. I. (2004):Kernel Dimensionality Reduction for Supervised Learning</ref> this can be written more formally as a conditional independence assertion.

[math]\displaystyle{ Y \bot (X-B^T X) | B^T X }[/math].

The above statement says that [math]\displaystyle{ S \subset X }[/math] such that the conditional probability density function [math]\displaystyle{ p_{Y|X}(y|x)\, }[/math] is preserved in the sense that [math]\displaystyle{ p_{Y|X}(y|x) = p_{Y|B^T X}(y|b^T x)\, }[/math] for all [math]\displaystyle{ x \in X \, }[/math] and [math]\displaystyle{ y \in Y \, }[/math], where [math]\displaystyle{ B^T X\, }[/math] is the orthogonal projection of [math]\displaystyle{ X\, }[/math] onto [math]\displaystyle{ S\, }[/math]. The subspace [math]\displaystyle{ S\, }[/math] is referred to as a dimension reduction subspace. Note that [math]\displaystyle{ S\, }[/math] is not unique.

We can define a minimal subspace as the intersection of all dimension reduction subspaces [math]\displaystyle{ S\, }[/math]. However, a minimal subspaces will not necessarily satisfy the conditional independence assertion specified above. But when it does, it is referred to as the central subspace.

This is one of the primary goals of the method i.e. to find a central subspace. Several approaches have been introduced in the past, mostly based on inverse regression (Li, 1991)<ref>Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 316–327</ref> (Li, 1992)<ref>On principal Hessian directions for data visualization and dimension reduction: Another application of Stein’s lemma. Journal of the American Statistical Association, 86, 316–342.</ref>. The main intuition behind this approach is to find [math]\displaystyle{ \mathbb{E[} X|Y \mathbb{]} }[/math] becuase if the the forward regression model [math]\displaystyle{ P(X|Y) }[/math] is concenterated in a subspace of [math]\displaystyle{ X }[/math], then [math]\displaystyle{ \mathbb{E[} X|Y \mathbb{]} }[/math] should also lie in [math]\displaystyle{ X }[/math] (See Li. 1991 for more details<ref>Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 316–327</ref>). Unfortunately, such an approach proposes a difficulty of making strong assumptions on the distribution of X (e.g. the distribution should be elliptical) and the methods of inverse regression fail if such assumptions are not satisfied. In order to overcome this problem, the authors turn to the description of KDR, i.e. an approach to SDR which does not make such strong assumptions.