dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces: Difference between revisions

Revision as of 16:15, 25 June 2009

It is a technique for dimensionality reduction in the setting of supervised learning.

Original Problem

Given a set of data [math]\displaystyle{ (X,Y) }[/math], where [math]\displaystyle{ X \in \mathbb{R}^n }[/math] is m-dimensional explanatory variable and [math]\displaystyle{ Y \in \mathbb{R}^{\ell} }[/math] is [math]\displaystyle{ \ell }[/math]-dimensional response. [math]\displaystyle{ Y }[/math] could be continuous or discrete.
We want to find an orthogonal projection [math]\displaystyle{ \Pi_S }[/math] of [math]\displaystyle{ \mathbb{R}^n }[/math] onto [math]\displaystyle{ S }[/math], which lets conditional probability
[math]\displaystyle{ p_{Y|X}(y|x) = p_{Y|\Pi_S X}(y|\Pi_S x). }[/math]
The subspace [math]\displaystyle{ S \in \mathbb{R}^m }[/math] is called the effective subspace of regression.
Equivalently, given i.i.d sample [math]\displaystyle{ \{(X_1,Y_1), \cdots, (X_n, Y_n)\} }[/math] from [math]\displaystyle{ p_X }[/math] and [math]\displaystyle{ p_{Y|X} }[/math] get a matrix whose columns span the effective subspace [math]\displaystyle{ S }[/math] which let
[math]\displaystyle{ p_{Y|X}(y|x) = p_{Y|\Pi_SX}(y|\Pi_S x). }[/math]

Decomposition of [math]\displaystyle{ X }[/math]

Let decomposition [math]\displaystyle{ W=(B,C) }[/math] be the m-dimensional orthogonal matrix, where column vectors of [math]\displaystyle{ B }[/math] span the subspace [math]\displaystyle{ S }[/math]. Therefore, with a [math]\displaystyle{ W_{n*n} }[/math] matrix, [math]\displaystyle{ (B, c) }[/math] should be [math]\displaystyle{ (B_{n*k}, C_{n*(n-k)}) }[/math]
Define [math]\displaystyle{ U=B^T X }[/math] and [math]\displaystyle{ V=C^T X }[/math], then we have
[math]\displaystyle{ p_X(x) = p_{U,V}(u,v) }[/math] and [math]\displaystyle{ p_{X,Y}(x,y)=p_{U,V,Y}(u,v,y) }[/math]
From the equations above, we have change the equation of condtional probability
[math]\displaystyle{ p_{Y|X}(y|x) = p_{Y|\Pi_S X}(y|\Pi_Sx) }[/math]
to
[math]\displaystyle{ p_{Y|U,V}(y|u,v) = p_{Y|U}(y|u), }[/math]
which means that [math]\displaystyle{ Y }[/math] and [math]\displaystyle{ V }[/math] are conditionally independent given [math]\displaystyle{ U }[/math], denoted as [math]\displaystyle{ Y \perp V|U }[/math]. In other words, our goal is to decompose [math]\displaystyle{ W }[/math] in [math]\displaystyle{ \mathbb{R}^m }[/math] space to [math]\displaystyle{ (B, C) }[/math] so that [math]\displaystyle{ U }[/math] with the given definition has all information of [math]\displaystyle{ Y }[/math] in [math]\displaystyle{ \mathbb{R}^l }[/math] space.

Reproducing Kernel Hilbert Space

A Hilbert space is an infinite dimension, inner product space that is a complete metric space. Elements of a Hilbert space may be functions. A reproducing kernel Hilbert space is a Hilbert space of functions on some set [math]\displaystyle{ T }[/math] such that there exists a function [math]\displaystyle{ K }[/math] (known as the reproducing kernel) on [math]\displaystyle{ T \times T }[/math], where for any [math]\displaystyle{ t \in T }[/math], [math]\displaystyle{ K( \cdot , t ) }[/math] is in the RKHS.

Cross-Covariance Operators

Let [math]\displaystyle{ ({ H}_1, k_1) }[/math] and [math]\displaystyle{ ({H}_2, k_2) }[/math] be RKHS over [math]\displaystyle{ (\Omega_1, { B}_1) }[/math] and [math]\displaystyle{ (\Omega_2, {B}_2) }[/math], respectively, with [math]\displaystyle{ k_1 }[/math] and [math]\displaystyle{ k_2 }[/math] measurable. For a random vector [math]\displaystyle{ (X, Y) }[/math] on [math]\displaystyle{ \Omega_1 \times \Omega_2 }[/math]. Then there exists a unique operator [math]\displaystyle{ \Sigma_{YX} }[/math] from [math]\displaystyle{ H_1 }[/math] to [math]\displaystyle{ H_2 }[/math] such that
[math]\displaystyle{ \lt g, \Sigma_{YX} f\gt _{H_2} = \mathbb{E}_{XY} [f(X)g(Y)] - \mathbb{E}[f(X)]\mathbb{E}[g(Y)] }[/math]
holds for all [math]\displaystyle{ f \in H_1 }[/math] and [math]\displaystyle{ g \in H_2 }[/math], which is called the cross-covariance operator.

Condtional Covariance Operators

Let [math]\displaystyle{ (H_1, k_1) }[/math], [math]\displaystyle{ (H_2, k_2) }[/math] and [math]\displaystyle{ (H_3, k_3) }[/math] be RKHS on [math]\displaystyle{ \Omega_1 \times \Omega_2 \times \Omega_3 }[/math], and let [math]\displaystyle{ (X,Y,Z) }[/math] be a random vector on measurable space [math]\displaystyle{ \Omega_1 \times \Omega_2 \times \Omega_3 }[/math]. The conditonal cross-covariance operator of [math]\displaystyle{ (X,Y) }[/math] given [math]\displaystyle{ Z }[/math] is defined by
[math]\displaystyle{ \Sigma_{YX|Z}: = \Sigma_{YX} - \Sigma_{YZ}\Sigma_{ZZ}^{-1}\Sigma_{ZX} }[/math].

Condditonal Covariance Operators and Condtional Indpendence

Let [math]\displaystyle{ (H_{11}, k_{11}) }[/math], [math]\displaystyle{ (H_{12},k_{12}) }[/math] and [math]\displaystyle{ (H_2, k_2) }[/math] be RKHS on measurable space [math]\displaystyle{ \Omega_{11} }[/math], [math]\displaystyle{ \Omega_{12} }[/math] and [math]\displaystyle{ \Omega_2 }[/math], respectively, with continuous and bounded kernels. Let [math]\displaystyle{ (X,Y)=(U,V,Y) }[/math] be a random vector on [math]\displaystyle{ \Omega{11}\times \Omega_{12} \times \Omega_{2} }[/math], where [math]\displaystyle{ X = (U,V) }[/math], and let [math]\displaystyle{ H_1 = H_{11} \otimes H_{12} }[/math] be the dirct product. It is assume that [math]\displaystyle{ \mathbb{E}_{Y|U} [g(Y)|U= \cdot] \in H_{11} }[/math] and [math]\displaystyle{ \mathbb{E}_{Y|X} [g(Y)|X= \cdot] \in H_{1} }[/math] for all [math]\displaystyle{ g \in H_2 }[/math]. Then we have
[math]\displaystyle{ \Sigma_{YY|U} \ge \Sigma_{YY|X} }[/math],
where the inequality refers to the order of self-adjoint operators. Furthermore, if [math]\displaystyle{ H_2 }[/math] is probability-deremining,
[math]\displaystyle{ \Sigma_{YY|X} = \Sigma_{YY|U} \Leftrightarrow Y \perp X|U }[/math].

Conditional Covariance Operators Formulation

Given i.i.d sample [math]\displaystyle{ \{(X_1,Y_1), \cdots, (X_n,Y_n)\} }[/math] from [math]\displaystyle{ p_X }[/math] and [math]\displaystyle{ p_{Y|X} }[/math] get a matrix whose columns span the effective subspace [math]\displaystyle{ S }[/math], which let
[math]\displaystyle{ \min_B\quad \Sigma_{YY|U} }[/math]
[math]\displaystyle{ s.t. \quad U = B^T X }[/math].

Estimation of Conditional Covariance Operator

Define [math]\displaystyle{ \tilde{k}_1^{(i)} \in H_1 }[/math] by
[math]\displaystyle{ \tilde{k}_1^{(i)} = k_1(\cdot, Y_i) -\frac{1}{n}\sum_{j=1}^n k_1(\cdot, Y_j) }[/math]
[math]\displaystyle{ \tilde{k}_2^{(i)} \in H_2 }[/math] by
[math]\displaystyle{ \tilde{k}_2^{(i)} = k_2(\cdot, U_i) -\frac{1}{n}\sum_{j=1}^n k_2(\cdot, U_j) }[/math].
By replacing the expectation with the empirical average, the covariance is estimated by
[math]\displaystyle{ Cov\lt f, \Sigma_{YUg}\gt _{H_1} \approx \frac{1}{n} \sum_{i=1}^n \lt \tilde{k}_1^{(i)}, f\gt \lt \tilde{k}_2^{(i)}, g\gt }[/math].
Let
[math]\displaystyle{ (\hat{K}_Y)_{ij} = \lt \hat{k}_1^{(i)}, \hat{k}_1^{(j)} \gt , \quad (\hat{K}_U)_{ij} = \lt \hat{k}_2^{(i)}, \hat{k}_2^{(j)} \gt }[/math]
We apply spectral decompostion to [math]\displaystyle{ f }[/math] and [math]\displaystyle{ g }[/math],
[math]\displaystyle{ f = \sum_{\ell =1}^n a_{\ell} \tilde{k}_1^{(\ell)} + f^{\perp}, }[/math]
[math]\displaystyle{ g = \sum_{m =1}^n b_{m} \tilde{k}_2^{(m)} + g^{\perp}. }[/math]
Therefore, we have
[math]\displaystyle{ \lt f, \Sigma_{YU}g\gt _{{H}_1} \approx \frac{1}{n} \sum_{\ell=1}^n \sum_{m=1}^n a_{\ell} b_m (\hat{K}_Y \hat{K}_U)_{\ell m} }[/math]. We can estimate the operator by
[math]\displaystyle{ \hat{\Sigma}_{YU} = \hat{K}_Y \hat{K}_U }[/math].
The conditional covariance matrix [math]\displaystyle{ \hat{\Sigma}_{YY|U} }[/math] is
[math]\displaystyle{ \hat{\Sigma}_{YY|U} = \hat{\Sigma}_{YY} - \hat{\Sigma}_{YU}\hat{\Sigma}_{UU}^{-1}\hat{\Sigma}_{UY} }[/math].
So, we have
[math]\displaystyle{ \hat{\Sigma}_{YY|U} = (\hat{K}_Y + \epsilon I_n)^2 - \hat{K}_Y\hat{K}_U(\hat{K}_U + \epsilon I_n)^{-2}\hat{K}_U \hat{K}_Y }[/math],
where [math]\displaystyle{ \epsilon \gt 0 }[/math] is a regularization constant.

Evaluation of Size of Self-adjoint Operators

We can use different methods to evaluate the size of [math]\displaystyle{ \hat{\Sigma}_{YY|U} }[/math].
First, operator norm, the maximum eigenvalue;
Second, trace norm, the sum of eigenvalue;
Thire, determinant, product of eigenvalues.
Here we use determinant to evaluate the size of [math]\displaystyle{ \hat{\Sigma}_{YY|U} }[/math].
By Schur decomposition
[math]\displaystyle{ \det(A - BC^{-1}B^T) = \frac{1}{\det C} \det \begin{bmatrix} A & B \\ B^T & C \end{bmatrix}, }[/math]
we have
[math]\displaystyle{ \det \hat{\Sigma}_{YY|U} = \frac{\det \hat{\Sigma}_{[YU][YU]}}{\det \hat{\Sigma}_{UU}} }[/math],
where
[math]\displaystyle{ \hat{\Sigma}_{[YU][YU]} = \begin{bmatrix}\hat{\Sigma}_{YY} & \hat{\Sigma}_{YU}\\ \hat{\Sigma}_{UY} & \hat{\Sigma}_{UU} \end{bmatrix}. }[/math]

Determinant Formulation

Given i.i.d sample [math]\displaystyle{ \{(X_1, Y_1), \cdots, (X_n,Y_n)\} }[/math] from [math]\displaystyle{ p_X }[/math] and [math]\displaystyle{ p_{Y|X} }[/math] get a matrix whose columns span the effective subspace [math]\displaystyle{ S }[/math] which let
[math]\displaystyle{ \min_B \quad \frac{\det \hat{\Sigma}_{[YU][YU]}}{\det \hat{\Sigma}_{UU} \det \hat{\Sigma}_{YY}} }[/math]
[math]\displaystyle{ s.t. \quad U=B^T X }[/math].

Gradient Descent

The matrix [math]\displaystyle{ B }[/math] is updated iteratively according to
[math]\displaystyle{ B(t+1) = B(t) - \eta \frac{\partial \log \det \hat{\Sigma}_{YY|U}}{\partial B} }[/math],
also it is
[math]\displaystyle{ B(t+1) = B(t) - \eta Tr[\hat{\Sigma}_{YY|U} \frac{\partial \hat{\Sigma}_{YY|U}}{\partial B}] }[/math],
where [math]\displaystyle{ \eta }[/math] is optimized through golden section search, and
[math]\displaystyle{ Tr[\hat{\Sigma}_{YY|U}^{-1}\frac{\partial \hat{\Sigma}_{YY|U}}{\partial B}] = 2 \epsilon Tr [\hat{\Sigma}_{YY|U}^{-1} \hat{K}_Y (\hat{K}_U + \epsilon I_n)^{-1} \frac{\partial \hat{K}_U}{\partial B}(\hat{K}_U + \epsilon I_n)^{-2}\hat{K}_U\hat{K}_Y] }[/math].

Summary

First, Dimension Reduction for Regression is equivalent to Conditional Independence.
Second, Conditional Covariance Operators give the criterion for KDR.
Third, KDR is a general framework of dimension reduction for regression, since it does not need assumption on [math]\displaystyle{ p(Y|X) }[/math], [math]\displaystyle{ p(X) }[/math] and [math]\displaystyle{ p(Y) }[/math] and dimensionality of [math]\displaystyle{ Y }[/math]. Furthermore, it is applicable for any type of data sets.
Forth, Multiplication of [math]\displaystyle{ n\times n }[/math] matrices is computationally hard.
Last, it has a Local Minimum, because of gradient descent.

Further research

First, Analyzing the statistical consistency of the KDR-based estiomator in a way whether the estimator will converge to the true subspace when such space realy exists. Second, doing research on methods for choosing the dimensionality of the effective space.

@@ Line 172: / Line 172: @@
 <br />
 Third, KDR is a general framework of dimension reduction for
-regression, since it do not need assumption on <math>p(Y|X)</math>, <math>p(X)</math> and
+regression, since it does not need assumption on <math>p(Y|X)</math>, <math>p(X)</math> and
 <math>p(Y)</math> and dimensionality of <math>Y</math>. Furthermore, it is applicable for
 any type of data sets.
@@ Line 180: / Line 180: @@
 <br />
 Last, it has a Local Minimum, because of gradient descent.
+===Further research===
+First, Analyzing the statistical consistency of the KDR-based estiomator in a way whether the estimator will converge to the true subspace when such space realy exists. Second, doing research on methods for choosing the dimensionality of the effective space.

dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces: Difference between revisions

Revision as of 16:15, 25 June 2009

Contents

Original Problem

Decomposition of [math]\displaystyle{ X }[/math]

Reproducing Kernel Hilbert Space

Cross-Covariance Operators

Condtional Covariance Operators

Condditonal Covariance Operators and Condtional Indpendence

Conditional Covariance Operators Formulation

Estimation of Conditional Covariance Operator

Evaluation of Size of Self-adjoint Operators

Determinant Formulation

Gradient Descent

Summary

Further research

Navigation menu

dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces: Difference between revisions

Revision as of 16:15, 25 June 2009

Original Problem

Decomposition of [math]\displaystyle{ X }[/math]

Reproducing Kernel Hilbert Space

Cross-Covariance Operators

Condtional Covariance Operators

Condditonal Covariance Operators and Condtional Indpendence

Conditional Covariance Operators Formulation

Estimation of Conditional Covariance Operator

Evaluation of Size of Self-adjoint Operators

Determinant Formulation

Gradient Descent

Summary

Further research

Navigation menu

Search