measuring Statistical Dependence with Hilbert-Schmidt Norm: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
Line 6: Line 6:
If instead of "independence" we were looking for "uncorrelation" the situation would be much easier to handle. Because "correlation" of two random variables involves only ''expected values'', as opposed to concept of independence between two random variables.  
If instead of "independence" we were looking for "uncorrelation" the situation would be much easier to handle. Because "correlation" of two random variables involves only ''expected values'', as opposed to concept of independence between two random variables.  
However, fortunately, based on the following theorem we still can effectively tackle the problem of independence between two random variables:<br><br>
However, fortunately, based on the following theorem we still can effectively tackle the problem of independence between two random variables:<br><br>
'''Theorem:''' Two random variables <math>x</math> and <math>y</math> are independent if and only if any (measurable) function of the two random variables are uncorrelated. <br><br>
'''Theorem:''' Two random variables <math>\,\!x</math> and <math>\,\!y</math> are independent if and only if any (measurable) function of the two random variables are uncorrelated. <br><br>
For example, for random variables <math>x</math> and <math>y</math> to be independent, all pairs <math>(x,y), (x^2,y),(x,e^y), (\sin(x),\log(y)), ...</math> have to be uncorrelated. '''That is why we need to deal with functions of random variables when examining the independence property.''' and That is why Hilbert spaces and concepts from "Functional Analysis" are involved in this article (Of course, no one likes to torture him/herself by using complicated mathematical expressions ;)).<br>
For example, for random variables <math>\,\!x</math> and <math>\,\!y</math> to be independent, all pairs <math>(x,y),\, (x^2,y),\,(x,e^y), \,(\sin(x),\log(y)),\, ...</math> have to be uncorrelated. '''That is why we need to deal with functions of random variables when examining the independence property.''' and That is why Hilbert spaces and concepts from "Functional Analysis" are involved in this article (Of course, no one likes to torture him/herself by using complicated mathematical expressions ;)).<br>
In the following, we first give an introductory note on the concepts from functional analysis and Hilbert spaces required in this paper. (No prior knowledge of these concepts is assumed in this article). Then we introduce the independence measure itself, namely, Hilbert-Schmidt Independence Criterion and explain why we expect it to work properly. Thirdly, we introduce an statistics which may be used to estimate the HSIC and discuss its convergence behavior. Finally, through some examples, we demonstrate how this criterion may be effectively helped to solve many well-known problems.
In the following, we first give an introductory note on the concepts from functional analysis and Hilbert spaces required in this paper. (No prior knowledge of these concepts is assumed here). Then we introduce the independence measure, namely, Hilbert-Schmidt Independence Criterion and explain why we expect it to work properly. Thirdly, we introduce an statistics which may be used to estimate the HSIC and discuss its convergence behavior. Finally, through some examples, we demonstrate how this criterion may be effectively employed to solve many well-known problems.
 
==Simple introduction to Hilbert Spaces==
==Simple introduction to Hilbert Spaces==



Revision as of 16:06, 28 June 2009

"Hilbert-Schmist Norm of the Cross-Covariance operator" is proposed as an independence criterion in reproducing kernel Hilbert spaces (RKHSs). The measure is refereed to as Hilbert-Schmidt Independence Criterion, or HSIC. An empirical estimate of this measure is introduced which may be well used in many practical application such as independent Component Analysis (ICA), Maximum Variance Unfolding (MVU), feature extraction, feature selection, ... .

Introduction

In this article, a criterion is introduced which is used to measure the dependence between two multivariate random variables. More specifically, we are looking for an appropriate function of two random variables whose output is zero if the involving random variables are independent and achieves a high value if they are statistically dependent. If instead of "independence" we were looking for "uncorrelation" the situation would be much easier to handle. Because "correlation" of two random variables involves only expected values, as opposed to concept of independence between two random variables. However, fortunately, based on the following theorem we still can effectively tackle the problem of independence between two random variables:

Theorem: Two random variables [math]\displaystyle{ \,\!x }[/math] and [math]\displaystyle{ \,\!y }[/math] are independent if and only if any (measurable) function of the two random variables are uncorrelated.

For example, for random variables [math]\displaystyle{ \,\!x }[/math] and [math]\displaystyle{ \,\!y }[/math] to be independent, all pairs [math]\displaystyle{ (x,y),\, (x^2,y),\,(x,e^y), \,(\sin(x),\log(y)),\, ... }[/math] have to be uncorrelated. That is why we need to deal with functions of random variables when examining the independence property. and That is why Hilbert spaces and concepts from "Functional Analysis" are involved in this article (Of course, no one likes to torture him/herself by using complicated mathematical expressions ;)).
In the following, we first give an introductory note on the concepts from functional analysis and Hilbert spaces required in this paper. (No prior knowledge of these concepts is assumed here). Then we introduce the independence measure, namely, Hilbert-Schmidt Independence Criterion and explain why we expect it to work properly. Thirdly, we introduce an statistics which may be used to estimate the HSIC and discuss its convergence behavior. Finally, through some examples, we demonstrate how this criterion may be effectively employed to solve many well-known problems.

Simple introduction to Hilbert Spaces

Introduction

One of the methods to find dependence between functions in RKHS, as proposed by Gretton et al ,the constarined covariance (COCO) was to use the largest singular value of the cross-covariance operator, which behaves identically to the correlation operator at independence and no regularization is required.
In this paper, the concept of COCO is extended by using the entire spectrum of the cross-covariance operator to determine when all its singular values are zero, rather than looking only at the largest singular value.

RKHS Theory

Let [math]\displaystyle{ \mathcal{F} }[/math] be a Hilbert space from [math]\displaystyle{ \mathcal{X} }[/math] to [math]\displaystyle{ \mathbb{R} }[/math]. We assume [math]\displaystyle{ \mathcal{F} }[/math] is a Reproducing Kernel Hilbert Space,i.e., for all [math]\displaystyle{ x\in \mathcal{X} }[/math], the corresponding Dirac evaluation operator [math]\displaystyle{ \delta_x:\mathcal{F} \rightarrow \mathbb{R} }[/math] is a bounded (or equivalently continuous) linear functional. We denote the kernel of this operator by [math]\displaystyle{ k(x,x')=\langle \phi(x)\phi(x') \rangle_{\mathcal{F}} }[/math] where [math]\displaystyle{ k:\mathcal{X}\times \mathcal{X}\rightarrow \mathbb{R} }[/math] is a positive definite function and [math]\displaystyle{ \,\!\phi }[/math] is the feature map of [math]\displaystyle{ \mathcal{F} }[/math]. Similarly, we consider another RKHS named [math]\displaystyle{ \mathcal{G} }[/math] with Domain [math]\displaystyle{ \mathcal{Y} }[/math], kernel [math]\displaystyle{ l(\cdot,\cdot) }[/math] and feature map [math]\displaystyle{ \,\!\psi }[/math]. We assume both [math]\displaystyle{ \mathcal{F} }[/math] and [math]\displaystyle{ \mathcal{G} }[/math] are separable, i.e., they have a complete orthogonal bases.

Hilbert-Schmidt Norm

For a linear operator [math]\displaystyle{ C:\mathcal{G}\rightarrow \mathcal{F} }[/math], provided the sum converges, the Hilbert-Schmidt (HS) norm is defined as:
[math]\displaystyle{ \|C\|^2_{HS}:=\sum_{i,j}\langle Cv_i,u_j \rangle^2_{\mathcal{F}} }[/math]
where [math]\displaystyle{ \,\!u_j }[/math] and [math]\displaystyle{ \,\!v_i }[/math] are orthogonal bases of [math]\displaystyle{ \mathcal{F} }[/math] and [math]\displaystyle{ \mathcal{G} }[/math], respectively. It is easy to see that the Frobenius norm on matrices may be considered a spacial case of this norm.

Hilbert-Schmidt Operator

A Hilbert-Schmidt Operator is a linear operator for which the Hilbert-Schmidt norm (introduced above) exists.

Tensor Product Operator

We may employ any [math]\displaystyle{ f\in \mathcal{F} }[/math] and [math]\displaystyle{ g\in \mathcal{G} }[/math] to define a tensor product operator [math]\displaystyle{ f\otimes g:\mathcal{G}\rightarrow\mathcal{F} }[/math] as follows:
[math]\displaystyle{ (f\otimes g)h:=f\langle g,h\rangle_{\mathcal{G}} \quad }[/math] for all [math]\displaystyle{ h\in\mathcal{G} }[/math]
Using the definition of HS norm introduced above, we can simply show the norm of [math]\displaystyle{ f\otimes g }[/math] equals
[math]\displaystyle{ \|f\otimes g\|^2_{HS}=\|f\|^2_{\mathcal{F}}\; \|g\|^2_{\mathcal{G}} }[/math]

Cross-Covariance Operator

Mean

Mean elements of [math]\displaystyle{ \mathcal{F} }[/math] and [math]\displaystyle{ \mathcal{G} }[/math] are defined as those elements of these spaces for which

[math]\displaystyle{ \langle\mu_x,f \rangle_{\mathcal{F}}=\mathbf{E}_x[\langle\phi(x),f \rangle_{\mathcal{F}}]=\mathbf{E}_x[f(x)] }[/math]
[math]\displaystyle{ \langle\mu_y,g \rangle_{\mathcal{G}}=\mathbf{E}_y[\langle\psi(y),g \rangle_{\mathcal{G}}]=\mathbf{E}_y[g(x)] }[/math]
Based on this, [math]\displaystyle{ \|\mu_x\|^2_{\mathcal{F}} }[/math] may be calculated by applying expectation twice as follows:
[math]\displaystyle{ \|\mu_x\|^2_{\mathcal{F}}=\mathbf{E}_{x,x'}[\langle \phi(x),\phi(x')\rangle_{\mathcal{F}}]=\mathbf{E}_{x,x'}[k(x,x')] }[/math]

Cross-covariance Operator

Now we are in aposition to define the cross-covariance operator as follows
[math]\displaystyle{ C_{xy}:=\underbrace{\mathbf{E}_{x,y}[\phi(x)\otimes\psi(y)]}_{:=\tilde{C}_{xy}}-\underbrace{\mu_x\otimes\mu_y}_{:=M_{xy}} }[/math]
We will use [math]\displaystyle{ \tilde{C}_{xy} }[/math] and [math]\displaystyle{ \,\!M_{xy} }[/math] as the basis of our measure of dependence.

Hilbert-Schmidt Independence Criterion

Definition (HSIC) Given separable RKHSs [math]\displaystyle{ \mathcal{F} }[/math], [math]\displaystyle{ \mathcal{G} }[/math] and a joint probability [math]\displaystyle{ p_{x\,\!y} }[/math], we define the Hilbert-Schmidt Independence Criterion(HSIC) as the squared HS-norm of the associated cross-covariance operator [math]\displaystyle{ C_{x\,\!y} }[/math]
[math]\displaystyle{ \text{HSIC}(p_{xy},\mathcal{F},\mathcal{G}):=\|C_{xy}\|^2_{HS} }[/math]

HSIC in terms of kernels

To compute HSIC we need to express it in terms of kernel functions. It can be shown that this can be achieved via the following identity:
[math]\displaystyle{ \text{HSIC}(p_{xy},\mathcal{F},\mathcal{G})=\mathbf{E}_{x,x',y,y'}[k(x,x')l(y,y')]+\mathbf{E}_{x,x'}[k(x,x')]\mathbf{E}_{y,y'}[l(y,y')]-2\mathbf{E}_{x,y}[\mathbf{E}_{x'}[k(x,x')]\mathbf{E}_{y'}[l(y,y')]] }[/math]

Because the data is random, the kernels are random so we use expected values.

Empirical Criterion

Definition (Empirical HSIC)

Let [math]\displaystyle{ Z:=\{(x_1,y_1),\cdots,(x_m,y_m)\}\subseteq \mathcal{X}\times\mathcal{Y} }[/math] be a series of [math]\displaystyle{ \,\!m }[/math] independent observations drawn from [math]\displaystyle{ p_{x\,\!y} }[/math]. An estimator of HSIC, is given by
[math]\displaystyle{ \text{HSIC}(Z,\mathcal{F},\mathcal{G}):=(m-1)^{-2}\textbf{tr}(KHLH) }[/math]
where [math]\displaystyle{ H, K, L\in\mathbb{R}^{m\times m}, K_{ij}:=k(x_i,x_j), L_{ij}:=l(y_i,y_j) \,\, \text{and} \, \, H_{ij}:=\delta_{ij}-m^{-1} }[/math].

Bias of Estimator

It may bee shown that the bias of the above empirical estimation is of the order [math]\displaystyle{ \,\!O(m^{-1}) }[/math]:
Theorem: Let [math]\displaystyle{ \mathbf{E}_Z }[/math] denote the expectation taken over [math]\displaystyle{ \,\!m }[/math] independent copies [math]\displaystyle{ \,\!(x_i,y_i) }[/math] drawn from [math]\displaystyle{ p_{\,\!xy} }[/math]. Then
[math]\displaystyle{ \text{HSIC}(p_{xy},\mathcal{F},\mathcal{G})=\mathbf{E}_Z[\text{HSIC}(Z,\mathcal{F},\mathcal{G})]+O(m^{-1}) }[/math].

Bound on Empirical HSIC

Theorem: Assume that [math]\displaystyle{ \,\!k }[/math] and [math]\displaystyle{ \,\!l }[/math] are bounded almost everywhere by 1, and are non-negative. Then for [math]\displaystyle{ \,\!m\gt 1 }[/math] and all [math]\displaystyle{ \,\!\delta\gt 0 }[/math], with probablity at least [math]\displaystyle{ \,\!1-\delta }[/math], for all [math]\displaystyle{ \,\!p_{xy} }[/math] :
[math]\displaystyle{ |\text{HSIC}(p_{xy},\mathcal{F},\mathcal{G})-\text{HSIC}(Z,\mathcal{F},\mathcal{G})|\leq\sqrt{\frac{log(6/\delta)}{\alpha^2m}}+\frac{C}{m} }[/math]
where [math]\displaystyle{ \,\!\alpha\gt 0.24 }[/math] and [math]\displaystyle{ \,\!C }[/math] are constants.

Independence Test using HSIC

Theorem: Denote by [math]\displaystyle{ \mathcal{F} }[/math], [math]\displaystyle{ \mathcal{G} }[/math] RKHSs with universal kernels [math]\displaystyle{ \,\!k }[/math], [math]\displaystyle{ \,\!l }[/math] on the compact domains [math]\displaystyle{ \mathcal{X} }[/math] and [math]\displaystyle{ \mathcal{Y} }[/math] respectively. We assume without loss of generality that [math]\displaystyle{ \|f\|_{\infty}\leq 1 }[/math] and [math]\displaystyle{ \|g\|_{\infty}\leq 1 }[/math] for all [math]\displaystyle{ f \in \mathcal{F} }[/math] and [math]\displaystyle{ g \in \mathcal{G} }[/math]. Then [math]\displaystyle{ \|C_{xy}\|_{HS} =0 }[/math] if and only if [math]\displaystyle{ \,\!x }[/math] and [math]\displaystyle{ \,\!y }[/math] are independent.
Based on this result, to maximize the dependence between two kernels we need to increase the value of the empirical estimate, i.e., [math]\displaystyle{ \,\!\textbf{tr}(KHLH) }[/math].
It can be shown that if even one of the kernels [math]\displaystyle{ \,\!K }[/math] or [math]\displaystyle{ \,\!L }[/math] is already centered, we may drop the centering matrices [math]\displaystyle{ \,\!H }[/math] and simply use the objective function [math]\displaystyle{ \,\!\textbf{tr}(KL) }[/math].

Kernel Canonical Correlation Analysis (CCA)

In Classical CCA linear mapping [math]\displaystyle{ a^{T}X }[/math] and [math]\displaystyle{ b^{T}Y }[/math] are obtained, respectively for random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math], such that the linear projections have maximum correlation.
Kernel CCA extends this method to estimate functions [math]\displaystyle{ f\in \mathcal{F} }[/math] and [math]\displaystyle{ g\in \mathcal{G} }[/math] such that the correlation of [math]\displaystyle{ f(X) }[/math] and [math]\displaystyle{ g(Y) }[/math] is maximized.
to find [math]\displaystyle{ f }[/math] and [math]\displaystyle{ g }[/math] the following optimization problem must be solved:

[math]\displaystyle{ max_{(f \in \mathcal{F} , g \in \mathcal{G})(f \neq 0 , g \neq 0)} \frac{Cov(f(x),g(Y))}{(Var(f(X))_{1/2} (Var(g(y))_{1/2}} }[/math]

For finite independent sample [math]\displaystyle{ {(X_{1},Y_{1}),(X_{2},Y_{2}),...,(X_{n},Y_{n})} }[/math] estimators for functions [math]\displaystyle{ f }[/math] and [math]\displaystyle{ g }[/math] can be obtained by optimizing the empirical version of the optimization problem:

[math]\displaystyle{ max_{(f \in \mathcal{F} , g \in \mathcal{G})(f \neq 0 , g \neq 0)} \frac{\widehat{Cov}(f(x),g(Y))}{(\widehat{Var}(f(X)+\epsilon_{n}||f||_{\mathcal{F}}^{2})_{1/2} (\widehat{Var}(g(y)+\epsilon_{n}||g||_{\mathcal{G}}^{2})_{1/2}} }[/math]

for which:

[math]\displaystyle{ \widehat{Cov}(f(x),g(Y))= \frac {1}{n} \sum_{i=1}^{n} \left( f(X_{i})- \sum_{j=1}^{n}f(X_{j}) \right) \left( g(Y_{i})- \sum_{j=1}^{n}g(Y_{j}) \right) }[/math]
[math]\displaystyle{ \widehat{Var}(f(X))= \frac {1}{n} \sum_{i=1}^{n} \left( f(X_{i})- \sum_{j=1}^{n}f(X_{j}) \right)^{2} }[/math] [math]\displaystyle{ \widehat{Var}(g(Y))= \frac {1}{n} \sum_{i=1}^{n} \left( g(Y_{i})- \sum_{j=1}^{n}g(Y_{j}) \right)^{2} }[/math]

Representation of Kernel CCA by Cross-Covariance Operator

Optimization problem for kernel CCA can be formulated using Cross-Covariance Operator for [math]\displaystyle{ (X,Y) }[/math]

[math]\displaystyle{ Sup_{(f \in \mathcal{F} , g \in \mathcal{G})} \langle g,\Sigma_{XY} f \rangle_{\mathcal{G}} }[/math] subject to [math]\displaystyle{ \langle f,\Sigma_{XX} f \rangle_{\mathcal{F}}=1 }[/math] and [math]\displaystyle{ \langle g,\Sigma_{YY} g \rangle_{\mathcal{G}}=1 }[/math]

Solution of this problem is the eigenfunction of the largest eigenvalue of the following eigenproblem

[math]\displaystyle{ \Sigma_{YX} f = \rho_{1} \Sigma_{YY} g }[/math]

[math]\displaystyle{ \Sigma_{YX} g = \rho_{1} \Sigma_{XX} f }[/math]