Difference between revisions of "dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces"

From statwiki
Jump to: navigation, search
(Cross-Covariance Operators)
(Original Problem)
Line 9: Line 9:
<br />
<br />
We want to find an orthogonal projection <math>\Pi_S</math> of <math>\mathbb{R}^n</math> onto <math>S</math>,
We want to find an orthogonal projection <math>\Pi_S</math> of <math>\mathbb{R}^n</math> onto <math>S</math>,
which let
which lets conditional probability
<br />
<br />
<math>p_{Y|X}(y|x) = p_{Y|\Pi_S X}(y|\Pi_S x).</math>
<math>p_{Y|X}(y|x) = p_{Y|\Pi_S X}(y|\Pi_S x).</math>

Revision as of 12:11, 25 June 2009

It is a technique for dimensionality reduction in the setting of supervised learning.

Original Problem

Given a set of data [math](X,Y)[/math], where [math]X \in \mathbb{R}^n[/math] is m-dimensional explanatory variable and [math]Y \in \mathbb{R}^{\ell}[/math] is [math]\ell[/math]-dimensional response. [math]Y[/math] could be continuous or discrete.
We want to find an orthogonal projection [math]\Pi_S[/math] of [math]\mathbb{R}^n[/math] onto [math]S[/math], which lets conditional probability
[math]p_{Y|X}(y|x) = p_{Y|\Pi_S X}(y|\Pi_S x).[/math]
The subspace [math]S \in \mathbb{R}^m[/math] is called the effective subspace of regression.
Equivalently, given i.i.d sample [math]\{(X_1,Y_1), \cdots, (X_n, Y_n)\}[/math] from [math]p_X[/math] and [math]p_{Y|X}[/math] get a matrix whose columns span the effective subspace [math]S[/math] which let
[math]p_{Y|X}(y|x) = p_{Y|\Pi_SX}(y|\Pi_S x).[/math]

Decomposition of [math]X[/math]

Let [math](B,C)[/math] be the m-dimensional orthogonal matrix, where column vectors of [math]B[/math] span the subspace [math]S[/math].
Define [math]U=B^T X[/math] and [math]V=C^T X[/math], then we have
[math]p_X(x) = p_{U,V}(u,v)[/math] and [math]p_{X,Y}(x,y)=p_{U,V,Y}(u,v,y)[/math]
From the equations above, we have change the equation of condtional probability
[math]p_{Y|X}(y|x) = p_{Y|\Pi_S X}(y|\Pi_Sx)[/math]
[math]p_{Y|U,V}(y|u,v) = p_{Y|U}(y|u),[/math]
which means that [math]Y[/math] and [math]V[/math] are conditionally independent given [math]U[/math], denoted as [math]Y \perp V|U[/math].

Reproducing Kernel Hilbert Space

Cross-Covariance Operators

Let [math]({ H}_1, k_1)[/math] and [math]({H}_2, k_2)[/math] be RKHS over [math](\Omega_1, { B}_1)[/math] and [math](\Omega_2, {B}_2)[/math], respectively, with [math]k_1[/math] and [math]k_2[/math] measurable. For a random vector [math](X, Y)[/math] on [math]\Omega_1 \times \Omega_2[/math]. Then there exists a unique operator [math]\Sigma_{YX}[/math] from [math]H_1[/math] to [math]H_2[/math] such that
[math]\lt g, \Sigma_{YX} f\gt _{H_2} = \mathbb{E}_{XY} [f(X)g(Y)] - \mathbb{E}[f(X)]\mathbb{E}[g(Y)][/math]
holds for all [math]f \in H_1[/math] and [math]g \in H_2[/math], which is called the cross-covariance operator.

Condtional Covariance Operators

Let [math](H_1, k_1)[/math], [math](H_2, k_2)[/math] and [math](H_3, k_3)[/math] be RKHS on [math]\Omega_1 \times \Omega_2 \times \Omega_3[/math], and let [math](X,Y,Z)[/math] be a random vector on measurable space [math]\Omega_1 \times \Omega_2 \times \Omega_3[/math]. The conditonal cross-covariance operator of [math](X,Y)[/math] given [math]Z[/math] is defined by
[math]\Sigma_{YX|Z}: = \Sigma_{YX} - \Sigma_{YZ}\Sigma_{ZZ}^{-1}\Sigma_{ZX}[/math].

Condditonal Covariance Operators and Condtional Indpendence

Let [math](H_{11}, k_{11})[/math], [math](H_{12},k_{12})[/math] and [math](H_2, k_2)[/math] be RKHS on measurable space [math]\Omega_{11}[/math], [math]\Omega_{12}[/math] and [math]\Omega_2[/math], respectively, with continuous and bounded kernels. Let [math](X,Y)=(U,V,Y)[/math] be a random vector on [math]\Omega{11}\times \Omega_{12} \times \Omega_{2}[/math], where [math]X = (U,V)[/math], and let [math]H_1 = H_{11} \otimes H_{12}[/math] be the dirct product. It is assume that [math]\mathbb{E}_{Y|U} [g(Y)|U= \cdot] \in H_{11}[/math] and [math]\mathbb{E}_{Y|X} [g(Y)|X= \cdot] \in H_{1}[/math] for all [math]g \in H_2[/math]. Then we have
[math]\Sigma_{YY|U} \ge \Sigma_{YY|X}[/math],
where the inequality refers to the order of self-adjoint operators. Furthermore, if [math]H_2[/math] is probability-deremining,
[math]\Sigma_{YY|X} = \Sigma_{YY|U} \Leftrightarrow Y \perp X|U [/math].

Conditional Covariance Operators Formulation

Given i.i.d sample [math]\{(X_1,Y_1), \cdots, (X_n,Y_n)\}[/math] from [math]p_X[/math] and [math]p_{Y|X}[/math] get a matrix whose columns span the effective subspace [math]S[/math], which let
[math]\min_B\quad \Sigma_{YY|U}[/math]
[math]s.t. \quad U = B^T X[/math].

Estimation of Conditional Covariance Operator

Define [math]\tilde{k}_1^{(i)} \in H_1[/math] by
[math]\tilde{k}_1^{(i)} = k_1(\cdot, Y_i) -\frac{1}{n}\sum_{j=1}^n k_1(\cdot, Y_j)[/math]
[math]\tilde{k}_2^{(i)} \in H_2[/math] by
[math]\tilde{k}_2^{(i)} = k_2(\cdot, U_i) -\frac{1}{n}\sum_{j=1}^n k_2(\cdot, U_j)[/math].
By replacing the expectation with the empirical average, the covariance is estimated by
[math]Cov\lt f, \Sigma_{YUg}\gt _{H_1} \approx \frac{1}{n} \sum_{i=1}^n \lt \tilde{k}_1^{(i)}, f\gt \lt \tilde{k}_2^{(i)}, g\gt [/math].
[math](\hat{K}_Y)_{ij} = \lt \hat{k}_1^{(i)}, \hat{k}_1^{(j)} \gt , \quad (\hat{K}_U)_{ij} = \lt \hat{k}_2^{(i)}, \hat{k}_2^{(j)} \gt [/math]
We apply spectral decompostion to [math]f[/math] and [math]g[/math],
[math]f = \sum_{\ell =1}^n a_{\ell} \tilde{k}_1^{(\ell)} + f^{\perp},[/math]
[math]g = \sum_{m =1}^n b_{m} \tilde{k}_2^{(m)} + g^{\perp}.[/math]
Therefore, we have
[math]\lt f, \Sigma_{YU}g\gt _{{H}_1} \approx \frac{1}{n} \sum_{\ell=1}^n \sum_{m=1}^n a_{\ell} b_m (\hat{K}_Y \hat{K}_U)_{\ell m}[/math]. We can estimate the operator by
[math]\hat{\Sigma}_{YU} = \hat{K}_Y \hat{K}_U[/math].
The conditional covariance matrix [math]\hat{\Sigma}_{YY|U}[/math] is
[math]\hat{\Sigma}_{YY|U} = \hat{\Sigma}_{YY} - \hat{\Sigma}_{YU}\hat{\Sigma}_{UU}^{-1}\hat{\Sigma}_{UY}[/math].
So, we have
[math]\hat{\Sigma}_{YY|U} = (\hat{K}_Y + \epsilon I_n)^2 - \hat{K}_Y\hat{K}_U(\hat{K}_U + \epsilon I_n)^{-2}\hat{K}_U \hat{K}_Y [/math],
where [math]\epsilon \gt 0[/math] is a regularization constant.

Evaluation of Size of Self-adjoint Operators

We can use different methods to evaluate the size of [math]\hat{\Sigma}_{YY|U}[/math].
First, operator norm, the maximum eigenvalue;
Second, trace norm, the sum of eigenvalue;
Thire, determinant, product of eigenvalues.
Here we use determinant to evaluate the size of [math]\hat{\Sigma}_{YY|U}[/math].
By Schur decomposition
[math]\det(A - BC^{-1}B^T) = \frac{1}{\det C} \det \begin{bmatrix} A & B \\ B^T & C \end{bmatrix}, [/math]
we have
[math]\det \hat{\Sigma}_{YY|U} = \frac{\det \hat{\Sigma}_{[YU][YU]}}{\det \hat{\Sigma}_{UU}}[/math],
[math]\hat{\Sigma}_{[YU][YU]} = \begin{bmatrix}\hat{\Sigma}_{YY} & \hat{\Sigma}_{YU}\\ \hat{\Sigma}_{UY} & \hat{\Sigma}_{UU} \end{bmatrix}. [/math]

Determinant Formulation

Given i.i.d sample [math]\{(X_1, Y_1), \cdots, (X_n,Y_n)\}[/math] from [math]p_X[/math] and [math]p_{Y|X}[/math] get a matrix whose columns span the effective subspace [math]S[/math] which let
[math]\min_B \quad \frac{\det \hat{\Sigma}_{[YU][YU]}}{\det \hat{\Sigma}_{UU} \det \hat{\Sigma}_{YY}}[/math]
[math]s.t. \quad U=B^T X[/math].

Gradient Descent

The matrix [math]B[/math] is updated iteratively according to
[math]B(t+1) = B(t) - \eta \frac{\partial \log \det \hat{\Sigma}_{YY|U}}{\partial B}[/math],
also it is
[math]B(t+1) = B(t) - \eta Tr[\hat{\Sigma}_{YY|U} \frac{\partial \hat{\Sigma}_{YY|U}}{\partial B}][/math],
where [math]\eta[/math] is optimized through golden section search, and
[math] Tr[\hat{\Sigma}_{YY|U}^{-1}\frac{\partial \hat{\Sigma}_{YY|U}}{\partial B}] = 2 \epsilon Tr [\hat{\Sigma}_{YY|U}^{-1} \hat{K}_Y (\hat{K}_U + \epsilon I_n)^{-1} \frac{\partial \hat{K}_U}{\partial B}(\hat{K}_U + \epsilon I_n)^{-2}\hat{K}_U\hat{K}_Y][/math].


First, Dimension Reduction for Regression is equivalent to Conditional Independence.
Second, Conditional Covariance Operators give the criterion for KDR.
Third, KDR is a general framework of dimension reduction for regression, since it do not need assumption on [math]p(Y|X)[/math], [math]p(X)[/math] and [math]p(Y)[/math] and dimensionality of [math]Y[/math]. Furthermore, it is applicable for any type of data sets.
Forth, Multiplication of [math]n\times n[/math] matrices is computationally hard.
Last, it has a Local Minimum, because of gradient descent.