dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces

From statwiki
Jump to navigation Jump to search

It is a technique for dimensionality reduction in the setting of supervised learning.

Original Problem

Given a set of data [math]\displaystyle{ (X,Y) }[/math], where [math]\displaystyle{ X \in \mathbb{R}^n }[/math] is m-dimensional explanatory variable and [math]\displaystyle{ Y \in \mathbb{R}^{\ell} }[/math] is [math]\displaystyle{ \ell }[/math]-dimensional response. [math]\displaystyle{ Y }[/math] could be continuous or discrete.
We want to find an orthogonal projection [math]\displaystyle{ \Pi_S }[/math] of [math]\displaystyle{ \mathbb{R}^n }[/math] onto [math]\displaystyle{ S }[/math], which lets conditional probability
[math]\displaystyle{ p_{Y|X}(y|x) = p_{Y|\Pi_S X}(y|\Pi_S x). }[/math]
The subspace [math]\displaystyle{ S \in \mathbb{R}^m }[/math] is called the effective subspace of regression.
Equivalently, given i.i.d sample [math]\displaystyle{ \{(X_1,Y_1), \cdots, (X_n, Y_n)\} }[/math] from [math]\displaystyle{ p_X }[/math] and [math]\displaystyle{ p_{Y|X} }[/math] get a matrix whose columns span the effective subspace [math]\displaystyle{ S }[/math] which let
[math]\displaystyle{ p_{Y|X}(y|x) = p_{Y|\Pi_SX}(y|\Pi_S x). }[/math]

Decomposition of [math]\displaystyle{ X }[/math]

Let decomposition [math]\displaystyle{ W=(B,C) }[/math] be the m-dimensional orthogonal matrix, where column vectors of [math]\displaystyle{ B }[/math] span the subspace [math]\displaystyle{ S }[/math]. Therefore, with a [math]\displaystyle{ W_{n*n} }[/math] matrix, [math]\displaystyle{ (B, c) }[/math] should be [math]\displaystyle{ (B_{n*k}, C_{n*(n-k)}) }[/math]
Define [math]\displaystyle{ U=B^T X }[/math] and [math]\displaystyle{ V=C^T X }[/math], then we have
[math]\displaystyle{ p_X(x) = p_{U,V}(u,v) }[/math] and [math]\displaystyle{ p_{X,Y}(x,y)=p_{U,V,Y}(u,v,y) }[/math]
From the equations above, we have change the equation of condtional probability
[math]\displaystyle{ p_{Y|X}(y|x) = p_{Y|\Pi_S X}(y|\Pi_Sx) }[/math]
to
[math]\displaystyle{ p_{Y|U,V}(y|u,v) = p_{Y|U}(y|u), }[/math]
which means that [math]\displaystyle{ Y }[/math] and [math]\displaystyle{ V }[/math] are conditionally independent given [math]\displaystyle{ U }[/math], denoted as [math]\displaystyle{ Y \perp V|U }[/math]. In other words, our goal is to decompose [math]\displaystyle{ W }[/math] in [math]\displaystyle{ \mathbb{R}^m }[/math] space to [math]\displaystyle{ (B, C) }[/math] so that [math]\displaystyle{ U }[/math] with the given definition has all information of [math]\displaystyle{ Y }[/math] in [math]\displaystyle{ \mathbb{R}^l }[/math] space.

Reproducing Kernel Hilbert Space

A Hilbert space is an infinite dimension, inner product space that is a complete metric space. Elements of a Hilbert space may be functions. A reproducing kernel Hilbert space is a Hilbert space of functions on some set [math]\displaystyle{ T }[/math] such that there exists a function [math]\displaystyle{ K }[/math] (known as the reproducing kernel) on [math]\displaystyle{ T \times T }[/math], where for any [math]\displaystyle{ t \in T }[/math], [math]\displaystyle{ K( \cdot , t ) }[/math] is in the RKHS.

Cross-Covariance Operators

Let [math]\displaystyle{ ({ H}_1, k_1) }[/math] and [math]\displaystyle{ ({H}_2, k_2) }[/math] be RKHS over [math]\displaystyle{ (\Omega_1, { B}_1) }[/math] and [math]\displaystyle{ (\Omega_2, {B}_2) }[/math], respectively, with [math]\displaystyle{ k_1 }[/math] and [math]\displaystyle{ k_2 }[/math] measurable. For a random vector [math]\displaystyle{ (X, Y) }[/math] on [math]\displaystyle{ \Omega_1 \times \Omega_2 }[/math]. Then there exists a unique operator [math]\displaystyle{ \Sigma_{YX} }[/math] from [math]\displaystyle{ H_1 }[/math] to [math]\displaystyle{ H_2 }[/math] such that
[math]\displaystyle{ \lt g, \Sigma_{YX} f\gt _{H_2} = \mathbb{E}_{XY} [f(X)g(Y)] - \mathbb{E}[f(X)]\mathbb{E}[g(Y)] }[/math]
holds for all [math]\displaystyle{ f \in H_1 }[/math] and [math]\displaystyle{ g \in H_2 }[/math], which is called the cross-covariance operator.

Condtional Covariance Operators

Let [math]\displaystyle{ (H_1, k_1) }[/math], [math]\displaystyle{ (H_2, k_2) }[/math] and [math]\displaystyle{ (H_3, k_3) }[/math] be RKHS on [math]\displaystyle{ \Omega_1 \times \Omega_2 \times \Omega_3 }[/math], and let [math]\displaystyle{ (X,Y,Z) }[/math] be a random vector on measurable space [math]\displaystyle{ \Omega_1 \times \Omega_2 \times \Omega_3 }[/math]. The conditonal cross-covariance operator of [math]\displaystyle{ (X,Y) }[/math] given [math]\displaystyle{ Z }[/math] is defined by
[math]\displaystyle{ \Sigma_{YX|Z}: = \Sigma_{YX} - \Sigma_{YZ}\Sigma_{ZZ}^{-1}\Sigma_{ZX} }[/math].

Condditonal Covariance Operators and Condtional Indpendence

Let [math]\displaystyle{ (H_{11}, k_{11}) }[/math], [math]\displaystyle{ (H_{12},k_{12}) }[/math] and [math]\displaystyle{ (H_2, k_2) }[/math] be RKHS on measurable space [math]\displaystyle{ \Omega_{11} }[/math], [math]\displaystyle{ \Omega_{12} }[/math] and [math]\displaystyle{ \Omega_2 }[/math], respectively, with continuous and bounded kernels. Let [math]\displaystyle{ (X,Y)=(U,V,Y) }[/math] be a random vector on [math]\displaystyle{ \Omega{11}\times \Omega_{12} \times \Omega_{2} }[/math], where [math]\displaystyle{ X = (U,V) }[/math], and let [math]\displaystyle{ H_1 = H_{11} \otimes H_{12} }[/math] be the dirct product. It is assume that [math]\displaystyle{ \mathbb{E}_{Y|U} [g(Y)|U= \cdot] \in H_{11} }[/math] and [math]\displaystyle{ \mathbb{E}_{Y|X} [g(Y)|X= \cdot] \in H_{1} }[/math] for all [math]\displaystyle{ g \in H_2 }[/math]. Then we have
[math]\displaystyle{ \Sigma_{YY|U} \ge \Sigma_{YY|X} }[/math],
where the inequality refers to the order of self-adjoint operators. Furthermore, if [math]\displaystyle{ H_2 }[/math] is probability-deremining,
[math]\displaystyle{ \Sigma_{YY|X} = \Sigma_{YY|U} \Leftrightarrow Y \perp X|U }[/math].

Conditional Covariance Operators Formulation

Given i.i.d sample [math]\displaystyle{ \{(X_1,Y_1), \cdots, (X_n,Y_n)\} }[/math] from [math]\displaystyle{ p_X }[/math] and [math]\displaystyle{ p_{Y|X} }[/math] get a matrix whose columns span the effective subspace [math]\displaystyle{ S }[/math], which let
[math]\displaystyle{ \min_B\quad \Sigma_{YY|U} }[/math]
[math]\displaystyle{ s.t. \quad U = B^T X }[/math].

Estimation of Conditional Covariance Operator

Define [math]\displaystyle{ \tilde{k}_1^{(i)} \in H_1 }[/math] by
[math]\displaystyle{ \tilde{k}_1^{(i)} = k_1(\cdot, Y_i) -\frac{1}{n}\sum_{j=1}^n k_1(\cdot, Y_j) }[/math]
[math]\displaystyle{ \tilde{k}_2^{(i)} \in H_2 }[/math] by
[math]\displaystyle{ \tilde{k}_2^{(i)} = k_2(\cdot, U_i) -\frac{1}{n}\sum_{j=1}^n k_2(\cdot, U_j) }[/math].
By replacing the expectation with the empirical average, the covariance is estimated by
[math]\displaystyle{ Cov\lt f, \Sigma_{YUg}\gt _{H_1} \approx \frac{1}{n} \sum_{i=1}^n \lt \tilde{k}_1^{(i)}, f\gt \lt \tilde{k}_2^{(i)}, g\gt }[/math].
Let
[math]\displaystyle{ (\hat{K}_Y)_{ij} = \lt \hat{k}_1^{(i)}, \hat{k}_1^{(j)} \gt , \quad (\hat{K}_U)_{ij} = \lt \hat{k}_2^{(i)}, \hat{k}_2^{(j)} \gt }[/math]
We apply spectral decompostion to [math]\displaystyle{ f }[/math] and [math]\displaystyle{ g }[/math],
[math]\displaystyle{ f = \sum_{\ell =1}^n a_{\ell} \tilde{k}_1^{(\ell)} + f^{\perp}, }[/math]
[math]\displaystyle{ g = \sum_{m =1}^n b_{m} \tilde{k}_2^{(m)} + g^{\perp}. }[/math]
Therefore, we have
[math]\displaystyle{ \lt f, \Sigma_{YU}g\gt _{{H}_1} \approx \frac{1}{n} \sum_{\ell=1}^n \sum_{m=1}^n a_{\ell} b_m (\hat{K}_Y \hat{K}_U)_{\ell m} }[/math]. We can estimate the operator by
[math]\displaystyle{ \hat{\Sigma}_{YU} = \hat{K}_Y \hat{K}_U }[/math].
The conditional covariance matrix [math]\displaystyle{ \hat{\Sigma}_{YY|U} }[/math] is
[math]\displaystyle{ \hat{\Sigma}_{YY|U} = \hat{\Sigma}_{YY} - \hat{\Sigma}_{YU}\hat{\Sigma}_{UU}^{-1}\hat{\Sigma}_{UY} }[/math].
So, we have
[math]\displaystyle{ \hat{\Sigma}_{YY|U} = (\hat{K}_Y + \epsilon I_n)^2 - \hat{K}_Y\hat{K}_U(\hat{K}_U + \epsilon I_n)^{-2}\hat{K}_U \hat{K}_Y }[/math],
where [math]\displaystyle{ \epsilon \gt 0 }[/math] is a regularization constant.

Evaluation of Size of Self-adjoint Operators

We can use different methods to evaluate the size of [math]\displaystyle{ \hat{\Sigma}_{YY|U} }[/math].
First, operator norm, the maximum eigenvalue;
Second, trace norm, the sum of eigenvalue;
Thire, determinant, product of eigenvalues.
Here we use determinant to evaluate the size of [math]\displaystyle{ \hat{\Sigma}_{YY|U} }[/math].
By Schur decomposition
[math]\displaystyle{ \det(A - BC^{-1}B^T) = \frac{1}{\det C} \det \begin{bmatrix} A & B \\ B^T & C \end{bmatrix}, }[/math]
we have
[math]\displaystyle{ \det \hat{\Sigma}_{YY|U} = \frac{\det \hat{\Sigma}_{[YU][YU]}}{\det \hat{\Sigma}_{UU}} }[/math],
where
[math]\displaystyle{ \hat{\Sigma}_{[YU][YU]} = \begin{bmatrix}\hat{\Sigma}_{YY} & \hat{\Sigma}_{YU}\\ \hat{\Sigma}_{UY} & \hat{\Sigma}_{UU} \end{bmatrix}. }[/math]

Determinant Formulation

Given i.i.d sample [math]\displaystyle{ \{(X_1, Y_1), \cdots, (X_n,Y_n)\} }[/math] from [math]\displaystyle{ p_X }[/math] and [math]\displaystyle{ p_{Y|X} }[/math] get a matrix whose columns span the effective subspace [math]\displaystyle{ S }[/math] which let
[math]\displaystyle{ \min_B \quad \frac{\det \hat{\Sigma}_{[YU][YU]}}{\det \hat{\Sigma}_{UU} \det \hat{\Sigma}_{YY}} }[/math]
[math]\displaystyle{ s.t. \quad U=B^T X }[/math].

Gradient Descent

The matrix [math]\displaystyle{ B }[/math] is updated iteratively according to
[math]\displaystyle{ B(t+1) = B(t) - \eta \frac{\partial \log \det \hat{\Sigma}_{YY|U}}{\partial B} }[/math],
also it is
[math]\displaystyle{ B(t+1) = B(t) - \eta Tr[\hat{\Sigma}_{YY|U} \frac{\partial \hat{\Sigma}_{YY|U}}{\partial B}] }[/math],
where [math]\displaystyle{ \eta }[/math] is optimized through golden section search, and
[math]\displaystyle{ Tr[\hat{\Sigma}_{YY|U}^{-1}\frac{\partial \hat{\Sigma}_{YY|U}}{\partial B}] = 2 \epsilon Tr [\hat{\Sigma}_{YY|U}^{-1} \hat{K}_Y (\hat{K}_U + \epsilon I_n)^{-1} \frac{\partial \hat{K}_U}{\partial B}(\hat{K}_U + \epsilon I_n)^{-2}\hat{K}_U\hat{K}_Y] }[/math].

Summary

First, Dimension Reduction for Regression is equivalent to Conditional Independence.
Second, Conditional Covariance Operators give the criterion for KDR.
Third, KDR is a general framework of dimension reduction for regression, since it does not need assumption on [math]\displaystyle{ p(Y|X) }[/math], [math]\displaystyle{ p(X) }[/math] and [math]\displaystyle{ p(Y) }[/math] and dimensionality of [math]\displaystyle{ Y }[/math]. Furthermore, it is applicable for any type of data sets.
Forth, Multiplication of [math]\displaystyle{ n\times n }[/math] matrices is computationally hard.
Last, it has a Local Minimum, because of gradient descent.

Further research

First, Analyzing the statistical consistency of the KDR-based estiomator in a way whether the estimator will converge to the true subspace when such space realy exists. Second, doing research on methods for choosing the dimensionality of the effective space.