# Difference between revisions of "dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces"

Jump to: navigation, search

It is a technique for dimensionality reduction in the setting of supervised learning.

### Original Problem

Given a set of data $(X,Y)$, where $X \in \mathbb{R}^n$ is m-dimensional explanatory variable and $Y \in \mathbb{R}^{\ell}$ is $\ell$-dimensional response. $Y$ could be continuous or discrete.
We want to find an orthogonal projection $\Pi_S$ of $\mathbb{R}^n$ onto $S$, which lets conditional probability
$p_{Y|X}(y|x) = p_{Y|\Pi_S X}(y|\Pi_S x).$
The subspace $S \in \mathbb{R}^m$ is called the effective subspace of regression.
Equivalently, given i.i.d sample $\{(X_1,Y_1), \cdots, (X_n, Y_n)\}$ from $p_X$ and $p_{Y|X}$ get a matrix whose columns span the effective subspace $S$ which let
$p_{Y|X}(y|x) = p_{Y|\Pi_SX}(y|\Pi_S x).$

### Decomposition of $X$

Let decomposition $W=(B,C)$ be the m-dimensional orthogonal matrix, where column vectors of $B$ span the subspace $S$.
Define $U=B^T X$ and $V=C^T X$, then we have
$p_X(x) = p_{U,V}(u,v)$ and $p_{X,Y}(x,y)=p_{U,V,Y}(u,v,y)$
From the equations above, we have change the equation of condtional probability
$p_{Y|X}(y|x) = p_{Y|\Pi_S X}(y|\Pi_Sx)$
to
$p_{Y|U,V}(y|u,v) = p_{Y|U}(y|u),$
which means that $Y$ and $V$ are conditionally independent given $U$, denoted as $Y \perp V|U$.

### Reproducing Kernel Hilbert Space

A Hilbert space is an infinite dimension, inner product space that is a complete metric space. Elements of a Hilbert space may be functions. A reproducing kernel Hilbert space is a Hilbert space of functions on some set $T$ such that there exists a function $K$ (known as the reproducing kernel) on $T \times T$, where for any $t \in T$, $K( \cdot , t )$ is in the RKHS.

### Cross-Covariance Operators

Let $({ H}_1, k_1)$ and $({H}_2, k_2)$ be RKHS over $(\Omega_1, { B}_1)$ and $(\Omega_2, {B}_2)$, respectively, with $k_1$ and $k_2$ measurable. For a random vector $(X, Y)$ on $\Omega_1 \times \Omega_2$. Then there exists a unique operator $\Sigma_{YX}$ from $H_1$ to $H_2$ such that
$\lt g, \Sigma_{YX} f\gt _{H_2} = \mathbb{E}_{XY} [f(X)g(Y)] - \mathbb{E}[f(X)]\mathbb{E}[g(Y)]$
holds for all $f \in H_1$ and $g \in H_2$, which is called the cross-covariance operator.

### Condtional Covariance Operators

Let $(H_1, k_1)$, $(H_2, k_2)$ and $(H_3, k_3)$ be RKHS on $\Omega_1 \times \Omega_2 \times \Omega_3$, and let $(X,Y,Z)$ be a random vector on measurable space $\Omega_1 \times \Omega_2 \times \Omega_3$. The conditonal cross-covariance operator of $(X,Y)$ given $Z$ is defined by
$\Sigma_{YX|Z}: = \Sigma_{YX} - \Sigma_{YZ}\Sigma_{ZZ}^{-1}\Sigma_{ZX}$.

### Condditonal Covariance Operators and Condtional Indpendence

Let $(H_{11}, k_{11})$, $(H_{12},k_{12})$ and $(H_2, k_2)$ be RKHS on measurable space $\Omega_{11}$, $\Omega_{12}$ and $\Omega_2$, respectively, with continuous and bounded kernels. Let $(X,Y)=(U,V,Y)$ be a random vector on $\Omega{11}\times \Omega_{12} \times \Omega_{2}$, where $X = (U,V)$, and let $H_1 = H_{11} \otimes H_{12}$ be the dirct product. It is assume that $\mathbb{E}_{Y|U} [g(Y)|U= \cdot] \in H_{11}$ and $\mathbb{E}_{Y|X} [g(Y)|X= \cdot] \in H_{1}$ for all $g \in H_2$. Then we have
$\Sigma_{YY|U} \ge \Sigma_{YY|X}$,
where the inequality refers to the order of self-adjoint operators. Furthermore, if $H_2$ is probability-deremining,
$\Sigma_{YY|X} = \Sigma_{YY|U} \Leftrightarrow Y \perp X|U$.

### Conditional Covariance Operators Formulation

Given i.i.d sample $\{(X_1,Y_1), \cdots, (X_n,Y_n)\}$ from $p_X$ and $p_{Y|X}$ get a matrix whose columns span the effective subspace $S$, which let
$\min_B\quad \Sigma_{YY|U}$
$s.t. \quad U = B^T X$.

### Estimation of Conditional Covariance Operator

Define $\tilde{k}_1^{(i)} \in H_1$ by
$\tilde{k}_1^{(i)} = k_1(\cdot, Y_i) -\frac{1}{n}\sum_{j=1}^n k_1(\cdot, Y_j)$
$\tilde{k}_2^{(i)} \in H_2$ by
$\tilde{k}_2^{(i)} = k_2(\cdot, U_i) -\frac{1}{n}\sum_{j=1}^n k_2(\cdot, U_j)$.
By replacing the expectation with the empirical average, the covariance is estimated by
$Cov\lt f, \Sigma_{YUg}\gt _{H_1} \approx \frac{1}{n} \sum_{i=1}^n \lt \tilde{k}_1^{(i)}, f\gt \lt \tilde{k}_2^{(i)}, g\gt$.
Let
$(\hat{K}_Y)_{ij} = \lt \hat{k}_1^{(i)}, \hat{k}_1^{(j)} \gt , \quad (\hat{K}_U)_{ij} = \lt \hat{k}_2^{(i)}, \hat{k}_2^{(j)} \gt$
We apply spectral decompostion to $f$ and $g$,
$f = \sum_{\ell =1}^n a_{\ell} \tilde{k}_1^{(\ell)} + f^{\perp},$
$g = \sum_{m =1}^n b_{m} \tilde{k}_2^{(m)} + g^{\perp}.$
Therefore, we have
$\lt f, \Sigma_{YU}g\gt _{{H}_1} \approx \frac{1}{n} \sum_{\ell=1}^n \sum_{m=1}^n a_{\ell} b_m (\hat{K}_Y \hat{K}_U)_{\ell m}$. We can estimate the operator by
$\hat{\Sigma}_{YU} = \hat{K}_Y \hat{K}_U$.
The conditional covariance matrix $\hat{\Sigma}_{YY|U}$ is
$\hat{\Sigma}_{YY|U} = \hat{\Sigma}_{YY} - \hat{\Sigma}_{YU}\hat{\Sigma}_{UU}^{-1}\hat{\Sigma}_{UY}$.
So, we have
$\hat{\Sigma}_{YY|U} = (\hat{K}_Y + \epsilon I_n)^2 - \hat{K}_Y\hat{K}_U(\hat{K}_U + \epsilon I_n)^{-2}\hat{K}_U \hat{K}_Y$,
where $\epsilon \gt 0$ is a regularization constant.

### Evaluation of Size of Self-adjoint Operators

We can use different methods to evaluate the size of $\hat{\Sigma}_{YY|U}$.
First, operator norm, the maximum eigenvalue;
Second, trace norm, the sum of eigenvalue;
Thire, determinant, product of eigenvalues.
Here we use determinant to evaluate the size of $\hat{\Sigma}_{YY|U}$.
By Schur decomposition
$\det(A - BC^{-1}B^T) = \frac{1}{\det C} \det \begin{bmatrix} A & B \\ B^T & C \end{bmatrix},$
we have
$\det \hat{\Sigma}_{YY|U} = \frac{\det \hat{\Sigma}_{[YU][YU]}}{\det \hat{\Sigma}_{UU}}$,
where
$\hat{\Sigma}_{[YU][YU]} = \begin{bmatrix}\hat{\Sigma}_{YY} & \hat{\Sigma}_{YU}\\ \hat{\Sigma}_{UY} & \hat{\Sigma}_{UU} \end{bmatrix}.$

### Determinant Formulation

Given i.i.d sample $\{(X_1, Y_1), \cdots, (X_n,Y_n)\}$ from $p_X$ and $p_{Y|X}$ get a matrix whose columns span the effective subspace $S$ which let
$\min_B \quad \frac{\det \hat{\Sigma}_{[YU][YU]}}{\det \hat{\Sigma}_{UU} \det \hat{\Sigma}_{YY}}$
$s.t. \quad U=B^T X$.

### Gradient Descent

The matrix $B$ is updated iteratively according to
$B(t+1) = B(t) - \eta \frac{\partial \log \det \hat{\Sigma}_{YY|U}}{\partial B}$,
also it is
$B(t+1) = B(t) - \eta Tr[\hat{\Sigma}_{YY|U} \frac{\partial \hat{\Sigma}_{YY|U}}{\partial B}]$,
where $\eta$ is optimized through golden section search, and
$Tr[\hat{\Sigma}_{YY|U}^{-1}\frac{\partial \hat{\Sigma}_{YY|U}}{\partial B}] = 2 \epsilon Tr [\hat{\Sigma}_{YY|U}^{-1} \hat{K}_Y (\hat{K}_U + \epsilon I_n)^{-1} \frac{\partial \hat{K}_U}{\partial B}(\hat{K}_U + \epsilon I_n)^{-2}\hat{K}_U\hat{K}_Y]$.

### Summary

First, Dimension Reduction for Regression is equivalent to Conditional Independence.
Second, Conditional Covariance Operators give the criterion for KDR.
Third, KDR is a general framework of dimension reduction for regression, since it do not need assumption on $p(Y|X)$, $p(X)$ and $p(Y)$ and dimensionality of $Y$. Furthermore, it is applicable for any type of data sets.
Forth, Multiplication of $n\times n$ matrices is computationally hard.
Last, it has a Local Minimum, because of gradient descent.