dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces

Jump to: navigation, search

It is a technique for dimensionality reduction in the setting of supervised learning.

Introduction

The paper presents a technique(Kernel Dimensionality Reduction, KDR in short) for dimensionality reduction in the setting of supervised learning. Before explaining KDR, it is useful to review supervised learning and establish notations.

Supervised learning techniques learn a function from training data, which consists of pairs of input objects and desired output objects. When the output takes continuous values, we refer to the supervised learning problem as a "regression" problem; when the output takes discrete values, we refer to the supervised learning problem as a "class classification" problem. Because the theory developed in the paper allows the output space to be continuous or discrete, KDR is, therefore, a dimensionality reduction technique for both regression and class classification.

In its most generic form, a regression problem can be stated as follows: We are given data consisting of observations of $(X,Y)\,$ pairs, where $X \in \mathbb{R}^n$ is an $m\,$-dimensional explanatory variable and $Y \in \mathbb{R}^{\ell}$ is an $l\,$-dimensional response. The regression problem is to estimate the conditional probability density function $p_{Y|X}(y|x)\,$.

When $X\,$ is high-dimensional, estimating $p_{Y|X}(y|x)\,$ is computationally expensive. Life would be much better if there exists a low-dimension subspace $S \subsetneq X$ such that the conditional probability density function $p_{Y|X}(y|x)\,$ is preserved in the sense that $p_{Y|X}(y|x) = p_{Y|\Pi_S X}(y|\Pi_S x)\,$ for all $x \in X \,$ and $y \in Y \,$, where $\Pi_S \,$ is the orthogonal projection of $X\,$ onto $S\,$. In the paper, as well as many other related work in the dimensionality reduction literature, researchers assume the existence and uniqueness of such a subspace $S \,$, henceforward referred as the effective subspace of regression, and proceed to estimate this $S \,$.

To put it in the most succinct way, the paper treats the problem of finding the subspace $S \,$ given an i.i.d sample $\{(x_1,y_1), \cdots, (x_n, y_n)\}$ from $p_X \,$ and $p_{Y|X} \,$ so that $p_{Y|X}(y|x) = p_{Y|\Pi_S X}(y|\Pi_S x)\,$ for all $x_i \,$ and $y_i \,$ in the training data.

Note that the equality $p_{Y|X}(y|x) = p_{Y|\Pi_S X}(y|\Pi_S x)\,$ is equivalent to saying that $Y \,$ depends wholly on $\Pi_S X \,$ and does not depend at all on $(\Pi_S X)^\perp = (\mathbf{I}-\Pi_S)X \,$, the orthogonal complement of $\Pi_S X \,$. The dimensionality reduction problem is thus equivalent to estimating the projection $\Pi_S \,$ which makes $Y \,$ and $(\mathbf I-\Pi_s) X \,$ conditionally independent given $\Pi_S X \,$.

One main result of this paper is converting the dimensionality reduction problem of estimating $S \,$ to an optimization problem which is expressed in terms of covariance operators(see below).

Original Problem

Given a set of data $(X,Y)\,$, where $X \in \mathbb{R}^n$ is m-dimensional explanatory variable and $Y \in \mathbb{R}^{\ell}$ is $\ell$-dimensional response. $Y$ could be continuous or discrete.
We want to find an orthogonal projection $\Pi_S$ of $\mathbb{R}^n$ onto $S$, which lets conditional probability
$p_{Y|X}(y|x) = p_{Y|\Pi_S X}(y|\Pi_S x).$
The subspace $S \in \mathbb{R}^m$ is called the effective subspace of regression.
Equivalently, given i.i.d sample $\{(X_1,Y_1), \cdots, (X_n, Y_n)\}$ from $p_X$ and $p_{Y|X}$ get a matrix whose columns span the effective subspace $S$ which let
$p_{Y|X}(y|x) = p_{Y|\Pi_SX}(y|\Pi_S x).$
The above is equivalent to finding a projection $\Pi_S$ which makes Y and $(\mathbf I-\Pi_s)$X conditionally independent given $\Pi_S X$. This dimensionality reduction problem is converted into an optimization problem by expressing it in terms of covariance operators (see below).

Decomposition of $X$

Let decomposition $W=(B,C)$ be the m-dimensional orthogonal matrix, where column vectors of $B$ span the subspace $S$. Therefore, with a $W_{n*n}$ matrix, $(B, c)$ should be $(B_{n*k}, C_{n*(n-k)})$
Define $U=B^T X$ and $V=C^T X$, then we have
$p_X(x) = p_{U,V}(u,v)$ and $p_{X,Y}(x,y)=p_{U,V,Y}(u,v,y)$
From the equations above, we have change the equation of condtional probability
$p_{Y|X}(y|x) = p_{Y|\Pi_S X}(y|\Pi_Sx)$
to
$p_{Y|U,V}(y|u,v) = p_{Y|U}(y|u),$
which means that $Y$ and $V$ are conditionally independent given $U$, denoted as $Y \perp V|U$. In other words, our goal is to decompose $W$ in $\mathbb{R}^m$ space to $(B, C)$ so that $U$ with the given definition has all information of $Y$ in $\mathbb{R}^l$ space.

Reproducing Kernel Hilbert Space

A Hilbert space is a (possibly infinite dimension) inner product space that is a complete metric space. Elements of a Hilbert space may be functions. A reproducing kernel Hilbert space is a Hilbert space of functions on some set $T$ such that there exists a function $K$ (known as the reproducing kernel) on $T \times T$, where for any $t \in T$, $K( \cdot , t )$ is in the RKHS.

Cross-Covariance Operators

Let $({ H}_1, k_1)$ and $({H}_2, k_2)$ be RKHS over $(\Omega_1, { B}_1)$ and $(\Omega_2, {B}_2)$, respectively, with $k_1$ and $k_2$ measurable. For a random vector $(X, Y)$ on $\Omega_1 \times \Omega_2$. Using the Reisz representation theorem, one may show that there exists a unique operator $\Sigma_{YX}$ from $H_1$ to $H_2$ such that
$\lt g, \Sigma_{YX} f\gt _{H_2} = \mathbb{E}_{XY} [f(X)g(Y)] - \mathbb{E}[f(X)]\mathbb{E}[g(Y)]$
holds for all $f \in H_1$ and $g \in H_2$, which is called the cross-covariance operator.

Condtional Covariance Operators

Let $(H_1, k_1)$, $(H_2, k_2)$ and $(H_3, k_3)$ be RKHS on $\Omega_1 \times \Omega_2 \times \Omega_3$, and let $(X,Y,Z)$ be a random vector on measurable space $\Omega_1 \times \Omega_2 \times \Omega_3$. The conditonal cross-covariance operator of $(X,Y)$ given $Z$ is defined by
$\Sigma_{YX|Z}: = \Sigma_{YX} - \Sigma_{YZ}\Sigma_{ZZ}^{-1}\Sigma_{ZX}$.

Condditonal Covariance Operators and Condtional Indpendence

Let $(H_{11}, k_{11})$, $(H_{12},k_{12})$ and $(H_2, k_2)$ be RKHS on measurable space $\Omega_{11}$, $\Omega_{12}$ and $\Omega_2$, respectively, with continuous and bounded kernels. Let $(X,Y)=(U,V,Y)$ be a random vector on $\Omega{11}\times \Omega_{12} \times \Omega_{2}$, where $X = (U,V)$, and let $H_1 = H_{11} \otimes H_{12}$ be the dirct product. It is assume that $\mathbb{E}_{Y|U} [g(Y)|U= \cdot] \in H_{11}$ and $\mathbb{E}_{Y|X} [g(Y)|X= \cdot] \in H_{1}$ for all $g \in H_2$. Then we have
$\Sigma_{YY|U} \ge \Sigma_{YY|X}$,
where the inequality refers to the order of self-adjoint operators. Furthermore, if $H_2$ is probability-deremining,
$\Sigma_{YY|X} = \Sigma_{YY|U} \Leftrightarrow Y \perp X|U$.
Therefore, the effective subspace S can be found by minimizing the following function:
$\min_S\quad \Sigma_{YY|U}$,$s.t. \quad U = \Pi_S X$.

Note here for
$\Sigma_{YY|U} \ge \Sigma_{YY|X}$,
in the sense of operator, the inequality means the variance of $Y$ given data $U$ is bigger than the variance of $Y$ given data $X$, which makes sense that $U$ is just a part of the whole data $X$.

Conditional Covariance Operators Formulation

Given i.i.d sample $\{(X_1,Y_1), \cdots, (X_n,Y_n)\}$ from $p_X$ and $p_{Y|X}$ get a matrix whose columns span the effective subspace $S$, which let
$\min_B\quad \Sigma_{YY|U}$
$s.t. \quad U = B^T X$.

Estimation of Conditional Covariance Operator

Define $\tilde{k}_1^{(i)} \in H_1$ by
$\tilde{k}_1^{(i)} = k_1(\cdot, Y_i) -\frac{1}{n}\sum_{j=1}^n k_1(\cdot, Y_j)$
$\tilde{k}_2^{(i)} \in H_2$ by
$\tilde{k}_2^{(i)} = k_2(\cdot, U_i) -\frac{1}{n}\sum_{j=1}^n k_2(\cdot, U_j)$.
By replacing the expectation with the empirical average, the covariance is estimated by
$Cov\lt f, \Sigma_{YUg}\gt _{H_1} \approx \frac{1}{n} \sum_{i=1}^n \lt \tilde{k}_1^{(i)}, f\gt \lt \tilde{k}_2^{(i)}, g\gt$.
Let
$(\hat{K}_Y)_{ij} = \lt \hat{k}_1^{(i)}, \hat{k}_1^{(j)} \gt , \quad (\hat{K}_U)_{ij} = \lt \hat{k}_2^{(i)}, \hat{k}_2^{(j)} \gt$
We apply spectral decompostion to $f$ and $g$,
$f = \sum_{\ell =1}^n a_{\ell} \tilde{k}_1^{(\ell)} + f^{\perp},$
$g = \sum_{m =1}^n b_{m} \tilde{k}_2^{(m)} + g^{\perp}.$
Therefore, we have
$\lt f, \Sigma_{YU}g\gt _{{H}_1} \approx \frac{1}{n} \sum_{\ell=1}^n \sum_{m=1}^n a_{\ell} b_m (\hat{K}_Y \hat{K}_U)_{\ell m}$. We can estimate the operator by
$\hat{\Sigma}_{YU} = \hat{K}_Y \hat{K}_U$.
The conditional covariance matrix $\hat{\Sigma}_{YY|U}$ is
$\hat{\Sigma}_{YY|U} = \hat{\Sigma}_{YY} - \hat{\Sigma}_{YU}\hat{\Sigma}_{UU}^{-1}\hat{\Sigma}_{UY}$.
So, we have
$\hat{\Sigma}_{YY|U} = (\hat{K}_Y + \epsilon I_n)^2 - \hat{K}_Y\hat{K}_U(\hat{K}_U + \epsilon I_n)^{-2}\hat{K}_U \hat{K}_Y$,
where $\epsilon \gt 0$ is a regularization constant.

Evaluation of Size of Self-adjoint Operators

We can use different methods to evaluate the size of $\hat{\Sigma}_{YY|U}$.
First, operator norm, the maximum eigenvalue;
Second, trace norm, the sum of eigenvalue;
Thire, determinant, product of eigenvalues.
Here we use determinant to evaluate the size of $\hat{\Sigma}_{YY|U}$.
By Schur decomposition
$\det(A - BC^{-1}B^T) = \frac{1}{\det C} \det \begin{bmatrix} A & B \\ B^T & C \end{bmatrix},$
we have
$\det \hat{\Sigma}_{YY|U} = \frac{\det \hat{\Sigma}_{[YU][YU]}}{\det \hat{\Sigma}_{UU}}$,
where
$\hat{\Sigma}_{[YU][YU]} = \begin{bmatrix}\hat{\Sigma}_{YY} & \hat{\Sigma}_{YU}\\ \hat{\Sigma}_{UY} & \hat{\Sigma}_{UU} \end{bmatrix}.$
Minimizing the above function with respect to the choice of subspace S is called kernel dimensioanlity reduction (KDR).

Determinant Formulation

Given i.i.d sample $\{(X_1, Y_1), \cdots, (X_n,Y_n)\}$ from $p_X$ and $p_{Y|X}$ get a matrix whose columns span the effective subspace $S$ which let
$\min_B \quad \frac{\det \hat{\Sigma}_{[YU][YU]}}{\det \hat{\Sigma}_{UU} \det \hat{\Sigma}_{YY}}$
$s.t. \quad U=B^T X$.
The problem of local minima may arise when minimizing the above function. To overcome this problem the scale parameter $\sigma$ for the Gaussian kernel is decreased gradually during the iterations of optimization.

Gradient Descent

The matrix $B$ is updated iteratively according to
$B(t+1) = B(t) - \eta \frac{\partial \log \det \hat{\Sigma}_{YY|U}}{\partial B}$,
also it is
$B(t+1) = B(t) - \eta Tr[\hat{\Sigma}_{YY|U} \frac{\partial \hat{\Sigma}_{YY|U}}{\partial B}]$,
where $\eta$ is optimized through golden section search, and
$Tr[\hat{\Sigma}_{YY|U}^{-1}\frac{\partial \hat{\Sigma}_{YY|U}}{\partial B}] = 2 \epsilon Tr [\hat{\Sigma}_{YY|U}^{-1} \hat{K}_Y (\hat{K}_U + \epsilon I_n)^{-1} \frac{\partial \hat{K}_U}{\partial B}(\hat{K}_U + \epsilon I_n)^{-2}\hat{K}_U\hat{K}_Y]$.

Summary

First, Dimension Reduction for Regression is equivalent to Conditional Independence.
Second, Conditional Covariance Operators give the criterion for KDR.
Third, KDR is a general framework of dimension reduction for regression, since it does not need assumption on $p(Y|X)$, $p(X)$ and $p(Y)$ and dimensionality of $Y$. Furthermore, it is applicable for any type of data sets.
Forth, Multiplication of $n\times n$ matrices is computationally hard.
Last, it has a Local Minimum, because of gradient descent.

Further research

First, Analyzing the statistical consistency of the KDR-based estiomator in a way whether the estimator will converge to the true subspace when such space realy exists. Second, doing research on methods for choosing the dimensionality of the effective space.