dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces

From statwiki
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

It is a technique for dimensionality reduction in the setting of supervised learning.

Original Problem

Given a set of data [math]\displaystyle{ (X,Y) }[/math], where [math]\displaystyle{ X \in \mathbb{R}^n }[/math] is m-dimensional explanatory variable and [math]\displaystyle{ Y \in \mathbb{R}^{\ell} }[/math] is [math]\displaystyle{ \ell }[/math]-dimensional response. [math]\displaystyle{ Y }[/math] could be continuous or discrete.
We want to find an orthogonal projection [math]\displaystyle{ \Pi_S }[/math] of [math]\displaystyle{ \mathbb{R}^n }[/math] onto [math]\displaystyle{ S }[/math], which lets conditional probability
[math]\displaystyle{ p_{Y|X}(y|x) = p_{Y|\Pi_S X}(y|\Pi_S x). }[/math]
The subspace [math]\displaystyle{ S \in \mathbb{R}^m }[/math] is called the effective subspace of regression.
Equivalently, given i.i.d sample [math]\displaystyle{ \{(X_1,Y_1), \cdots, (X_n, Y_n)\} }[/math] from [math]\displaystyle{ p_X }[/math] and [math]\displaystyle{ p_{Y|X} }[/math] get a matrix whose columns span the effective subspace [math]\displaystyle{ S }[/math] which let
[math]\displaystyle{ p_{Y|X}(y|x) = p_{Y|\Pi_SX}(y|\Pi_S x). }[/math]

Decomposition of [math]\displaystyle{ X }[/math]

Let decomposition [math]\displaystyle{ W=(B,C) }[/math] be the m-dimensional orthogonal matrix, where column vectors of [math]\displaystyle{ B }[/math] span the subspace [math]\displaystyle{ S }[/math].
Define [math]\displaystyle{ U=B^T X }[/math] and [math]\displaystyle{ V=C^T X }[/math], then we have
[math]\displaystyle{ p_X(x) = p_{U,V}(u,v) }[/math] and [math]\displaystyle{ p_{X,Y}(x,y)=p_{U,V,Y}(u,v,y) }[/math]
From the equations above, we have change the equation of condtional probability
[math]\displaystyle{ p_{Y|X}(y|x) = p_{Y|\Pi_S X}(y|\Pi_Sx) }[/math]
to
[math]\displaystyle{ p_{Y|U,V}(y|u,v) = p_{Y|U}(y|u), }[/math]
which means that [math]\displaystyle{ Y }[/math] and [math]\displaystyle{ V }[/math] are conditionally independent given [math]\displaystyle{ U }[/math], denoted as [math]\displaystyle{ Y \perp V|U }[/math].

Reproducing Kernel Hilbert Space

Cross-Covariance Operators

Let [math]\displaystyle{ ({ H}_1, k_1) }[/math] and [math]\displaystyle{ ({H}_2, k_2) }[/math] be RKHS over [math]\displaystyle{ (\Omega_1, { B}_1) }[/math] and [math]\displaystyle{ (\Omega_2, {B}_2) }[/math], respectively, with [math]\displaystyle{ k_1 }[/math] and [math]\displaystyle{ k_2 }[/math] measurable. For a random vector [math]\displaystyle{ (X, Y) }[/math] on [math]\displaystyle{ \Omega_1 \times \Omega_2 }[/math]. Then there exists a unique operator [math]\displaystyle{ \Sigma_{YX} }[/math] from [math]\displaystyle{ H_1 }[/math] to [math]\displaystyle{ H_2 }[/math] such that
[math]\displaystyle{ \lt g, \Sigma_{YX} f\gt _{H_2} = \mathbb{E}_{XY} [f(X)g(Y)] - \mathbb{E}[f(X)]\mathbb{E}[g(Y)] }[/math]
holds for all [math]\displaystyle{ f \in H_1 }[/math] and [math]\displaystyle{ g \in H_2 }[/math], which is called the cross-covariance operator.

Condtional Covariance Operators

Let [math]\displaystyle{ (H_1, k_1) }[/math], [math]\displaystyle{ (H_2, k_2) }[/math] and [math]\displaystyle{ (H_3, k_3) }[/math] be RKHS on [math]\displaystyle{ \Omega_1 \times \Omega_2 \times \Omega_3 }[/math], and let [math]\displaystyle{ (X,Y,Z) }[/math] be a random vector on measurable space [math]\displaystyle{ \Omega_1 \times \Omega_2 \times \Omega_3 }[/math]. The conditonal cross-covariance operator of [math]\displaystyle{ (X,Y) }[/math] given [math]\displaystyle{ Z }[/math] is defined by
[math]\displaystyle{ \Sigma_{YX|Z}: = \Sigma_{YX} - \Sigma_{YZ}\Sigma_{ZZ}^{-1}\Sigma_{ZX} }[/math].

Condditonal Covariance Operators and Condtional Indpendence

Let [math]\displaystyle{ (H_{11}, k_{11}) }[/math], [math]\displaystyle{ (H_{12},k_{12}) }[/math] and [math]\displaystyle{ (H_2, k_2) }[/math] be RKHS on measurable space [math]\displaystyle{ \Omega_{11} }[/math], [math]\displaystyle{ \Omega_{12} }[/math] and [math]\displaystyle{ \Omega_2 }[/math], respectively, with continuous and bounded kernels. Let [math]\displaystyle{ (X,Y)=(U,V,Y) }[/math] be a random vector on [math]\displaystyle{ \Omega{11}\times \Omega_{12} \times \Omega_{2} }[/math], where [math]\displaystyle{ X = (U,V) }[/math], and let [math]\displaystyle{ H_1 = H_{11} \otimes H_{12} }[/math] be the dirct product. It is assume that [math]\displaystyle{ \mathbb{E}_{Y|U} [g(Y)|U= \cdot] \in H_{11} }[/math] and [math]\displaystyle{ \mathbb{E}_{Y|X} [g(Y)|X= \cdot] \in H_{1} }[/math] for all [math]\displaystyle{ g \in H_2 }[/math]. Then we have
[math]\displaystyle{ \Sigma_{YY|U} \ge \Sigma_{YY|X} }[/math],
where the inequality refers to the order of self-adjoint operators. Furthermore, if [math]\displaystyle{ H_2 }[/math] is probability-deremining,
[math]\displaystyle{ \Sigma_{YY|X} = \Sigma_{YY|U} \Leftrightarrow Y \perp X|U }[/math].

Conditional Covariance Operators Formulation

Given i.i.d sample [math]\displaystyle{ \{(X_1,Y_1), \cdots, (X_n,Y_n)\} }[/math] from [math]\displaystyle{ p_X }[/math] and [math]\displaystyle{ p_{Y|X} }[/math] get a matrix whose columns span the effective subspace [math]\displaystyle{ S }[/math], which let
[math]\displaystyle{ \min_B\quad \Sigma_{YY|U} }[/math]
[math]\displaystyle{ s.t. \quad U = B^T X }[/math].

Estimation of Conditional Covariance Operator

Define [math]\displaystyle{ \tilde{k}_1^{(i)} \in H_1 }[/math] by
[math]\displaystyle{ \tilde{k}_1^{(i)} = k_1(\cdot, Y_i) -\frac{1}{n}\sum_{j=1}^n k_1(\cdot, Y_j) }[/math]
[math]\displaystyle{ \tilde{k}_2^{(i)} \in H_2 }[/math] by
[math]\displaystyle{ \tilde{k}_2^{(i)} = k_2(\cdot, U_i) -\frac{1}{n}\sum_{j=1}^n k_2(\cdot, U_j) }[/math].
By replacing the expectation with the empirical average, the covariance is estimated by
[math]\displaystyle{ Cov\lt f, \Sigma_{YUg}\gt _{H_1} \approx \frac{1}{n} \sum_{i=1}^n \lt \tilde{k}_1^{(i)}, f\gt \lt \tilde{k}_2^{(i)}, g\gt }[/math].
Let
[math]\displaystyle{ (\hat{K}_Y)_{ij} = \lt \hat{k}_1^{(i)}, \hat{k}_1^{(j)} \gt , \quad (\hat{K}_U)_{ij} = \lt \hat{k}_2^{(i)}, \hat{k}_2^{(j)} \gt }[/math]
We apply spectral decompostion to [math]\displaystyle{ f }[/math] and [math]\displaystyle{ g }[/math],
[math]\displaystyle{ f = \sum_{\ell =1}^n a_{\ell} \tilde{k}_1^{(\ell)} + f^{\perp}, }[/math]
[math]\displaystyle{ g = \sum_{m =1}^n b_{m} \tilde{k}_2^{(m)} + g^{\perp}. }[/math]
Therefore, we have
[math]\displaystyle{ \lt f, \Sigma_{YU}g\gt _{{H}_1} \approx \frac{1}{n} \sum_{\ell=1}^n \sum_{m=1}^n a_{\ell} b_m (\hat{K}_Y \hat{K}_U)_{\ell m} }[/math]. We can estimate the operator by
[math]\displaystyle{ \hat{\Sigma}_{YU} = \hat{K}_Y \hat{K}_U }[/math].
The conditional covariance matrix [math]\displaystyle{ \hat{\Sigma}_{YY|U} }[/math] is
[math]\displaystyle{ \hat{\Sigma}_{YY|U} = \hat{\Sigma}_{YY} - \hat{\Sigma}_{YU}\hat{\Sigma}_{UU}^{-1}\hat{\Sigma}_{UY} }[/math].
So, we have
[math]\displaystyle{ \hat{\Sigma}_{YY|U} = (\hat{K}_Y + \epsilon I_n)^2 - \hat{K}_Y\hat{K}_U(\hat{K}_U + \epsilon I_n)^{-2}\hat{K}_U \hat{K}_Y }[/math],
where [math]\displaystyle{ \epsilon \gt 0 }[/math] is a regularization constant.

Evaluation of Size of Self-adjoint Operators

We can use different methods to evaluate the size of [math]\displaystyle{ \hat{\Sigma}_{YY|U} }[/math].
First, operator norm, the maximum eigenvalue;
Second, trace norm, the sum of eigenvalue;
Thire, determinant, product of eigenvalues.
Here we use determinant to evaluate the size of [math]\displaystyle{ \hat{\Sigma}_{YY|U} }[/math].
By Schur decomposition
[math]\displaystyle{ \det(A - BC^{-1}B^T) = \frac{1}{\det C} \det \begin{bmatrix} A & B \\ B^T & C \end{bmatrix}, }[/math]
we have
[math]\displaystyle{ \det \hat{\Sigma}_{YY|U} = \frac{\det \hat{\Sigma}_{[YU][YU]}}{\det \hat{\Sigma}_{UU}} }[/math],
where
[math]\displaystyle{ \hat{\Sigma}_{[YU][YU]} = \begin{bmatrix}\hat{\Sigma}_{YY} & \hat{\Sigma}_{YU}\\ \hat{\Sigma}_{UY} & \hat{\Sigma}_{UU} \end{bmatrix}. }[/math]

Determinant Formulation

Given i.i.d sample [math]\displaystyle{ \{(X_1, Y_1), \cdots, (X_n,Y_n)\} }[/math] from [math]\displaystyle{ p_X }[/math] and [math]\displaystyle{ p_{Y|X} }[/math] get a matrix whose columns span the effective subspace [math]\displaystyle{ S }[/math] which let
[math]\displaystyle{ \min_B \quad \frac{\det \hat{\Sigma}_{[YU][YU]}}{\det \hat{\Sigma}_{UU} \det \hat{\Sigma}_{YY}} }[/math]
[math]\displaystyle{ s.t. \quad U=B^T X }[/math].

Gradient Descent

The matrix [math]\displaystyle{ B }[/math] is updated iteratively according to
[math]\displaystyle{ B(t+1) = B(t) - \eta \frac{\partial \log \det \hat{\Sigma}_{YY|U}}{\partial B} }[/math],
also it is
[math]\displaystyle{ B(t+1) = B(t) - \eta Tr[\hat{\Sigma}_{YY|U} \frac{\partial \hat{\Sigma}_{YY|U}}{\partial B}] }[/math],
where [math]\displaystyle{ \eta }[/math] is optimized through golden section search, and
[math]\displaystyle{ Tr[\hat{\Sigma}_{YY|U}^{-1}\frac{\partial \hat{\Sigma}_{YY|U}}{\partial B}] = 2 \epsilon Tr [\hat{\Sigma}_{YY|U}^{-1} \hat{K}_Y (\hat{K}_U + \epsilon I_n)^{-1} \frac{\partial \hat{K}_U}{\partial B}(\hat{K}_U + \epsilon I_n)^{-2}\hat{K}_U\hat{K}_Y] }[/math].

Summary

First, Dimension Reduction for Regression is equivalent to Conditional Independence.
Second, Conditional Covariance Operators give the criterion for KDR.
Third, KDR is a general framework of dimension reduction for regression, since it do not need assumption on [math]\displaystyle{ p(Y|X) }[/math], [math]\displaystyle{ p(X) }[/math] and [math]\displaystyle{ p(Y) }[/math] and dimensionality of [math]\displaystyle{ Y }[/math]. Furthermore, it is applicable for any type of data sets.
Forth, Multiplication of [math]\displaystyle{ n\times n }[/math] matrices is computationally hard.
Last, it has a Local Minimum, because of gradient descent.