dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 172: Line 172:
<br />
<br />
Third, KDR is a general framework of dimension reduction for
Third, KDR is a general framework of dimension reduction for
regression, since it do not need assumption on <math>p(Y|X)</math>, <math>p(X)</math> and
regression, since it does not need assumption on <math>p(Y|X)</math>, <math>p(X)</math> and
<math>p(Y)</math> and dimensionality of <math>Y</math>. Furthermore, it is applicable for
<math>p(Y)</math> and dimensionality of <math>Y</math>. Furthermore, it is applicable for
any type of data sets.
any type of data sets.
Line 180: Line 180:
<br />  
<br />  
Last, it has a Local Minimum, because of gradient descent.
Last, it has a Local Minimum, because of gradient descent.
===Further research===
First, Analyzing the statistical consistency of the KDR-based estiomator in a way whether the estimator will converge to the true subspace when such space realy exists. Second, doing research on methods for choosing the dimensionality of the effective space.

Revision as of 16:15, 25 June 2009

It is a technique for dimensionality reduction in the setting of supervised learning.

Original Problem

Given a set of data [math]\displaystyle{ (X,Y) }[/math], where [math]\displaystyle{ X \in \mathbb{R}^n }[/math] is m-dimensional explanatory variable and [math]\displaystyle{ Y \in \mathbb{R}^{\ell} }[/math] is [math]\displaystyle{ \ell }[/math]-dimensional response. [math]\displaystyle{ Y }[/math] could be continuous or discrete.
We want to find an orthogonal projection [math]\displaystyle{ \Pi_S }[/math] of [math]\displaystyle{ \mathbb{R}^n }[/math] onto [math]\displaystyle{ S }[/math], which lets conditional probability
[math]\displaystyle{ p_{Y|X}(y|x) = p_{Y|\Pi_S X}(y|\Pi_S x). }[/math]
The subspace [math]\displaystyle{ S \in \mathbb{R}^m }[/math] is called the effective subspace of regression.
Equivalently, given i.i.d sample [math]\displaystyle{ \{(X_1,Y_1), \cdots, (X_n, Y_n)\} }[/math] from [math]\displaystyle{ p_X }[/math] and [math]\displaystyle{ p_{Y|X} }[/math] get a matrix whose columns span the effective subspace [math]\displaystyle{ S }[/math] which let
[math]\displaystyle{ p_{Y|X}(y|x) = p_{Y|\Pi_SX}(y|\Pi_S x). }[/math]

Decomposition of [math]\displaystyle{ X }[/math]

Let decomposition [math]\displaystyle{ W=(B,C) }[/math] be the m-dimensional orthogonal matrix, where column vectors of [math]\displaystyle{ B }[/math] span the subspace [math]\displaystyle{ S }[/math]. Therefore, with a [math]\displaystyle{ W_{n*n} }[/math] matrix, [math]\displaystyle{ (B, c) }[/math] should be [math]\displaystyle{ (B_{n*k}, C_{n*(n-k)}) }[/math]
Define [math]\displaystyle{ U=B^T X }[/math] and [math]\displaystyle{ V=C^T X }[/math], then we have
[math]\displaystyle{ p_X(x) = p_{U,V}(u,v) }[/math] and [math]\displaystyle{ p_{X,Y}(x,y)=p_{U,V,Y}(u,v,y) }[/math]
From the equations above, we have change the equation of condtional probability
[math]\displaystyle{ p_{Y|X}(y|x) = p_{Y|\Pi_S X}(y|\Pi_Sx) }[/math]
to
[math]\displaystyle{ p_{Y|U,V}(y|u,v) = p_{Y|U}(y|u), }[/math]
which means that [math]\displaystyle{ Y }[/math] and [math]\displaystyle{ V }[/math] are conditionally independent given [math]\displaystyle{ U }[/math], denoted as [math]\displaystyle{ Y \perp V|U }[/math]. In other words, our goal is to decompose [math]\displaystyle{ W }[/math] in [math]\displaystyle{ \mathbb{R}^m }[/math] space to [math]\displaystyle{ (B, C) }[/math] so that [math]\displaystyle{ U }[/math] with the given definition has all information of [math]\displaystyle{ Y }[/math] in [math]\displaystyle{ \mathbb{R}^l }[/math] space.

Reproducing Kernel Hilbert Space

A Hilbert space is an infinite dimension, inner product space that is a complete metric space. Elements of a Hilbert space may be functions. A reproducing kernel Hilbert space is a Hilbert space of functions on some set [math]\displaystyle{ T }[/math] such that there exists a function [math]\displaystyle{ K }[/math] (known as the reproducing kernel) on [math]\displaystyle{ T \times T }[/math], where for any [math]\displaystyle{ t \in T }[/math], [math]\displaystyle{ K( \cdot , t ) }[/math] is in the RKHS.

Cross-Covariance Operators

Let [math]\displaystyle{ ({ H}_1, k_1) }[/math] and [math]\displaystyle{ ({H}_2, k_2) }[/math] be RKHS over [math]\displaystyle{ (\Omega_1, { B}_1) }[/math] and [math]\displaystyle{ (\Omega_2, {B}_2) }[/math], respectively, with [math]\displaystyle{ k_1 }[/math] and [math]\displaystyle{ k_2 }[/math] measurable. For a random vector [math]\displaystyle{ (X, Y) }[/math] on [math]\displaystyle{ \Omega_1 \times \Omega_2 }[/math]. Then there exists a unique operator [math]\displaystyle{ \Sigma_{YX} }[/math] from [math]\displaystyle{ H_1 }[/math] to [math]\displaystyle{ H_2 }[/math] such that
[math]\displaystyle{ \lt g, \Sigma_{YX} f\gt _{H_2} = \mathbb{E}_{XY} [f(X)g(Y)] - \mathbb{E}[f(X)]\mathbb{E}[g(Y)] }[/math]
holds for all [math]\displaystyle{ f \in H_1 }[/math] and [math]\displaystyle{ g \in H_2 }[/math], which is called the cross-covariance operator.

Condtional Covariance Operators

Let [math]\displaystyle{ (H_1, k_1) }[/math], [math]\displaystyle{ (H_2, k_2) }[/math] and [math]\displaystyle{ (H_3, k_3) }[/math] be RKHS on [math]\displaystyle{ \Omega_1 \times \Omega_2 \times \Omega_3 }[/math], and let [math]\displaystyle{ (X,Y,Z) }[/math] be a random vector on measurable space [math]\displaystyle{ \Omega_1 \times \Omega_2 \times \Omega_3 }[/math]. The conditonal cross-covariance operator of [math]\displaystyle{ (X,Y) }[/math] given [math]\displaystyle{ Z }[/math] is defined by
[math]\displaystyle{ \Sigma_{YX|Z}: = \Sigma_{YX} - \Sigma_{YZ}\Sigma_{ZZ}^{-1}\Sigma_{ZX} }[/math].

Condditonal Covariance Operators and Condtional Indpendence

Let [math]\displaystyle{ (H_{11}, k_{11}) }[/math], [math]\displaystyle{ (H_{12},k_{12}) }[/math] and [math]\displaystyle{ (H_2, k_2) }[/math] be RKHS on measurable space [math]\displaystyle{ \Omega_{11} }[/math], [math]\displaystyle{ \Omega_{12} }[/math] and [math]\displaystyle{ \Omega_2 }[/math], respectively, with continuous and bounded kernels. Let [math]\displaystyle{ (X,Y)=(U,V,Y) }[/math] be a random vector on [math]\displaystyle{ \Omega{11}\times \Omega_{12} \times \Omega_{2} }[/math], where [math]\displaystyle{ X = (U,V) }[/math], and let [math]\displaystyle{ H_1 = H_{11} \otimes H_{12} }[/math] be the dirct product. It is assume that [math]\displaystyle{ \mathbb{E}_{Y|U} [g(Y)|U= \cdot] \in H_{11} }[/math] and [math]\displaystyle{ \mathbb{E}_{Y|X} [g(Y)|X= \cdot] \in H_{1} }[/math] for all [math]\displaystyle{ g \in H_2 }[/math]. Then we have
[math]\displaystyle{ \Sigma_{YY|U} \ge \Sigma_{YY|X} }[/math],
where the inequality refers to the order of self-adjoint operators. Furthermore, if [math]\displaystyle{ H_2 }[/math] is probability-deremining,
[math]\displaystyle{ \Sigma_{YY|X} = \Sigma_{YY|U} \Leftrightarrow Y \perp X|U }[/math].

Conditional Covariance Operators Formulation

Given i.i.d sample [math]\displaystyle{ \{(X_1,Y_1), \cdots, (X_n,Y_n)\} }[/math] from [math]\displaystyle{ p_X }[/math] and [math]\displaystyle{ p_{Y|X} }[/math] get a matrix whose columns span the effective subspace [math]\displaystyle{ S }[/math], which let
[math]\displaystyle{ \min_B\quad \Sigma_{YY|U} }[/math]
[math]\displaystyle{ s.t. \quad U = B^T X }[/math].

Estimation of Conditional Covariance Operator

Define [math]\displaystyle{ \tilde{k}_1^{(i)} \in H_1 }[/math] by
[math]\displaystyle{ \tilde{k}_1^{(i)} = k_1(\cdot, Y_i) -\frac{1}{n}\sum_{j=1}^n k_1(\cdot, Y_j) }[/math]
[math]\displaystyle{ \tilde{k}_2^{(i)} \in H_2 }[/math] by
[math]\displaystyle{ \tilde{k}_2^{(i)} = k_2(\cdot, U_i) -\frac{1}{n}\sum_{j=1}^n k_2(\cdot, U_j) }[/math].
By replacing the expectation with the empirical average, the covariance is estimated by
[math]\displaystyle{ Cov\lt f, \Sigma_{YUg}\gt _{H_1} \approx \frac{1}{n} \sum_{i=1}^n \lt \tilde{k}_1^{(i)}, f\gt \lt \tilde{k}_2^{(i)}, g\gt }[/math].
Let
[math]\displaystyle{ (\hat{K}_Y)_{ij} = \lt \hat{k}_1^{(i)}, \hat{k}_1^{(j)} \gt , \quad (\hat{K}_U)_{ij} = \lt \hat{k}_2^{(i)}, \hat{k}_2^{(j)} \gt }[/math]
We apply spectral decompostion to [math]\displaystyle{ f }[/math] and [math]\displaystyle{ g }[/math],
[math]\displaystyle{ f = \sum_{\ell =1}^n a_{\ell} \tilde{k}_1^{(\ell)} + f^{\perp}, }[/math]
[math]\displaystyle{ g = \sum_{m =1}^n b_{m} \tilde{k}_2^{(m)} + g^{\perp}. }[/math]
Therefore, we have
[math]\displaystyle{ \lt f, \Sigma_{YU}g\gt _{{H}_1} \approx \frac{1}{n} \sum_{\ell=1}^n \sum_{m=1}^n a_{\ell} b_m (\hat{K}_Y \hat{K}_U)_{\ell m} }[/math]. We can estimate the operator by
[math]\displaystyle{ \hat{\Sigma}_{YU} = \hat{K}_Y \hat{K}_U }[/math].
The conditional covariance matrix [math]\displaystyle{ \hat{\Sigma}_{YY|U} }[/math] is
[math]\displaystyle{ \hat{\Sigma}_{YY|U} = \hat{\Sigma}_{YY} - \hat{\Sigma}_{YU}\hat{\Sigma}_{UU}^{-1}\hat{\Sigma}_{UY} }[/math].
So, we have
[math]\displaystyle{ \hat{\Sigma}_{YY|U} = (\hat{K}_Y + \epsilon I_n)^2 - \hat{K}_Y\hat{K}_U(\hat{K}_U + \epsilon I_n)^{-2}\hat{K}_U \hat{K}_Y }[/math],
where [math]\displaystyle{ \epsilon \gt 0 }[/math] is a regularization constant.

Evaluation of Size of Self-adjoint Operators

We can use different methods to evaluate the size of [math]\displaystyle{ \hat{\Sigma}_{YY|U} }[/math].
First, operator norm, the maximum eigenvalue;
Second, trace norm, the sum of eigenvalue;
Thire, determinant, product of eigenvalues.
Here we use determinant to evaluate the size of [math]\displaystyle{ \hat{\Sigma}_{YY|U} }[/math].
By Schur decomposition
[math]\displaystyle{ \det(A - BC^{-1}B^T) = \frac{1}{\det C} \det \begin{bmatrix} A & B \\ B^T & C \end{bmatrix}, }[/math]
we have
[math]\displaystyle{ \det \hat{\Sigma}_{YY|U} = \frac{\det \hat{\Sigma}_{[YU][YU]}}{\det \hat{\Sigma}_{UU}} }[/math],
where
[math]\displaystyle{ \hat{\Sigma}_{[YU][YU]} = \begin{bmatrix}\hat{\Sigma}_{YY} & \hat{\Sigma}_{YU}\\ \hat{\Sigma}_{UY} & \hat{\Sigma}_{UU} \end{bmatrix}. }[/math]

Determinant Formulation

Given i.i.d sample [math]\displaystyle{ \{(X_1, Y_1), \cdots, (X_n,Y_n)\} }[/math] from [math]\displaystyle{ p_X }[/math] and [math]\displaystyle{ p_{Y|X} }[/math] get a matrix whose columns span the effective subspace [math]\displaystyle{ S }[/math] which let
[math]\displaystyle{ \min_B \quad \frac{\det \hat{\Sigma}_{[YU][YU]}}{\det \hat{\Sigma}_{UU} \det \hat{\Sigma}_{YY}} }[/math]
[math]\displaystyle{ s.t. \quad U=B^T X }[/math].

Gradient Descent

The matrix [math]\displaystyle{ B }[/math] is updated iteratively according to
[math]\displaystyle{ B(t+1) = B(t) - \eta \frac{\partial \log \det \hat{\Sigma}_{YY|U}}{\partial B} }[/math],
also it is
[math]\displaystyle{ B(t+1) = B(t) - \eta Tr[\hat{\Sigma}_{YY|U} \frac{\partial \hat{\Sigma}_{YY|U}}{\partial B}] }[/math],
where [math]\displaystyle{ \eta }[/math] is optimized through golden section search, and
[math]\displaystyle{ Tr[\hat{\Sigma}_{YY|U}^{-1}\frac{\partial \hat{\Sigma}_{YY|U}}{\partial B}] = 2 \epsilon Tr [\hat{\Sigma}_{YY|U}^{-1} \hat{K}_Y (\hat{K}_U + \epsilon I_n)^{-1} \frac{\partial \hat{K}_U}{\partial B}(\hat{K}_U + \epsilon I_n)^{-2}\hat{K}_U\hat{K}_Y] }[/math].

Summary

First, Dimension Reduction for Regression is equivalent to Conditional Independence.
Second, Conditional Covariance Operators give the criterion for KDR.
Third, KDR is a general framework of dimension reduction for regression, since it does not need assumption on [math]\displaystyle{ p(Y|X) }[/math], [math]\displaystyle{ p(X) }[/math] and [math]\displaystyle{ p(Y) }[/math] and dimensionality of [math]\displaystyle{ Y }[/math]. Furthermore, it is applicable for any type of data sets.
Forth, Multiplication of [math]\displaystyle{ n\times n }[/math] matrices is computationally hard.
Last, it has a Local Minimum, because of gradient descent.

Further research

First, Analyzing the statistical consistency of the KDR-based estiomator in a way whether the estimator will converge to the true subspace when such space realy exists. Second, doing research on methods for choosing the dimensionality of the effective space.