measuring Statistical Dependence with Hilbert-Schmidt Norm: Difference between revisions

From statwiki
Jump to navigation Jump to search
 
(72 intermediate revisions by 3 users not shown)
Line 4: Line 4:
==Introduction==
==Introduction==


In this article, a criterion is introduced which is used to measure the dependence between two ''multivariate'' random variables. More specifically, we are looking for an appropriate function of two random variables whose output is zero if the involving random variables are independent and achieves a high value if they are statistically dependent.  
In this article <ref>  Gretton, A., Bousquet, O., Smola, A. J., and Schölkopf, B.,  Measuring Statistical Dependence with Hilbert-Schmidt Norms,  Proceedings Algorithmic Learning Theory,  S. Jain and W.-S. Lee (Eds.),  2005.</ref>, a criterion is introduced which is used to measure the dependence between two ''multivariate'' random variables. More specifically, we are looking for an appropriate function of two random variables whose output is zero if the involving random variables are independent and achieves a high value if they are statistically dependent.  
If instead of "independence" we were looking for "uncorrelation" the situation would be much easier to handle. Because "correlation" of two random variables involves only ''expected values'', as opposed to concept of independence between two random variables.  
If instead of "independence" we were looking for "uncorrelation" the situation would be much easier to handle. Because "correlation" of two random variables involves only ''expected values'', as opposed to concept of independence between two random variables.  
However, fortunately, based on the following theorem we still can effectively tackle the problem of independence between two random variables:<br><br>
However, fortunately, based on the following theorem we still can effectively tackle the problem of independence between two random variables:<br><br>
Line 15: Line 15:
==Simple introduction to Reproducing Kernel Hilbert Spaces==
==Simple introduction to Reproducing Kernel Hilbert Spaces==
===Basic concepts===  
===Basic concepts===  
Let <math>\mathcal{F}</math> denote the set containing all real-valued functions defined on domain <math>\,\!\mathcal{X}</math>. By real-valued, we mean the output of these functions is always a real number. (Note that this definition makes good sense considering the methodology mentioned in the previous section.) We may note that this space is ''linear'', as for any <math>f,g\in\mathcal{F}</math> and <math>a,b\in\mathbb{R}</math>, <math>\,\!af+bg</math> will be a member of <math>\mathcal{F}</math>, as well. On the other hand, we may define the concept of ''real-valued'' inner product between the elements of <math>\mathcal{F}</math>. The simplest way to do that probably is the familiar:<br>
Let <math>\mathcal{F}</math> denote the set containing all real-valued functions defined on domain <math>\,\!\mathcal{X}</math>. By real-valued, we mean the output of these functions is always a real number. (Note that this definition makes good sense considering the methodology mentioned in the introduction section.) We may note that this space is ''linear'', as for any <math>f,g\in\mathcal{F}</math> and <math>a,b\in\mathbb{R}</math>, <math>\,\!af+bg</math> will be a member of <math>\mathcal{F}</math>, as well. On the other hand, we may define the concept of ''real-valued'' inner product between the elements of <math>\mathcal{F}</math>. The simplest way to do that probably is the familiar:<br>
<math>\langle a,b\rangle=\int_{}^{}{f(t)g(t)dt}</math><br>
<math>\langle a,b\rangle=\int_{}^{}{f(t)g(t)dt}</math><br>
However, other definitions for the inner product is also possible. The only required conditions is that:<br>
However, other definitions for the inner product is also possible. The only required conditions are that:<br>
1. <math>\langle f,f\rangle\geq0,\,\forall\, f\in\mathcal{F} </math> with <math>\langle f,f\rangle=0</math> iff <math>\,\!f=0</math>.<br>
1. <math>\langle f,f\rangle\geq0,\,\forall\, f\in\mathcal{F} </math> with <math>\langle f,f\rangle=0</math> iff <math>\,\!f=0</math>.<br>
2. <math>\langle f,g\rangle=\langle g,f\rangle,\, \forall \,f,g\in \mathcal{F}</math>.<br>
2. <math>\langle f,g\rangle=\langle g,f\rangle,\, \forall \,f,g\in \mathcal{F}</math>.<br>
3. <math>\langle af+bg,h\rangle=a\langle f,h\rangle+b\langle g,h\rangle,\,\forall\,f,g,h\in\mathcal{F}</math> and all real <math>\,\!a</math> and <math>\,\!b</math>.<br>
3. <math>\langle af+bg,h\rangle=a\langle f,h\rangle+b\langle g,h\rangle,\,\forall\,f,g,h\in\mathcal{F}</math> and all real <math>\,\!a</math> and <math>\,\!b</math>.<br>


Endowed with the concept of inner product, the set <math>\mathcal{F}</math> here is, in fact, a ''Hilbert space''. In general, a Hilbert space is a (complete) linear space endowed with an inner product defined on the elements with the above properties. However, in this article we only deal with those Hilbert spaces whose elements are real-valued functions on <math>\mathcal{X}</math>. <br>
Endowed with the concept of inner product, the set <math>\mathcal{F}</math> here is, in fact, a ''Hilbert space''. (In general, a Hilbert space is a (complete) linear space endowed with an inner product defined on the elements with the above properties. However, in this article we only deal with those Hilbert spaces whose elements are real-valued functions on <math>\mathcal{X}</math>.) <br>
A mapping from the Hilbert space to <math>\mathbb{R}</math> is called a '''functional'''. A functional is called ''linear'' if the underlying mapping is linear, i.e.:<br>
A mapping from the Hilbert space to real numbers <math>L:\mathcal{F}\rightarrow \mathbb{R}</math> is called a '''functional'''. A functional is called ''linear'' if the underlying mapping is linear, i.e.:<br>
<math>L:\mathcal{F}\rightarrow \mathbb{R}</math>. <br>
<math>\,\!L(af+bg)=aL(f)+bL(g)</math><br>
<math>\,\!L(af+bg)=aL(f)+bL(g)</math><br>
for all <math>\,\!f</math> and <math>\,\!g</math> in <math>\mathcal{F}</math> and all real <math>\,\!a</math> and <math>\,\!b</math>. A simple (and yet very important) example of a functional is a function which gets an element of <math>\mathcal{F}</math> (which is a real-valued function on <math>\mathcal{X}</math>) as input and returns its value at a specific fixed point <math>\,\!x</math>. This functional is called Dirac evaluation functional and is denoted by <math>\delta_x:\mathcal{F} \rightarrow \mathbb{R}</math>. Thus we have <math>\,\!\delta_x(f)=f(x)</math>. It can be easily seen that <math>\,\!\delta_x</math> is a linear functional.<br>  
for all <math>\,\!f</math> and <math>\,\!g</math> in <math>\mathcal{F}</math> and all real <math>\,\!a</math> and <math>\,\!b</math>. A simple (and yet very important) example of a functional is a function which gets an element of <math>\mathcal{F}</math> (which is a real-valued function on <math>\mathcal{X}</math>) as input and returns its value at a specific fixed point <math>\,\!x</math>. This functional is called Dirac evaluation functional and is denoted by <math>\delta_x:\mathcal{F} \rightarrow \mathbb{R}</math>. Thus we have <math>\,\!\delta_x(f)=f(x)</math>. It can be easily seen that <math>\,\!\delta_x</math> is a linear functional.<br>  
Line 30: Line 29:
'''Theorem (Riesz representation):''' If <math>\,\!L</math> is a bounded linear functional on a Hilbert space <math>\mathcal{F}</math>, then there exists a ''unique'' element <math>\,\!h_f</math> in <math>\mathcal{F}</math> such that <math>L(f)=\langle f,h_f\rangle</math> for all <math>\,\!f</math> in <math>\mathcal{F}</math>.<br>
'''Theorem (Riesz representation):''' If <math>\,\!L</math> is a bounded linear functional on a Hilbert space <math>\mathcal{F}</math>, then there exists a ''unique'' element <math>\,\!h_f</math> in <math>\mathcal{F}</math> such that <math>L(f)=\langle f,h_f\rangle</math> for all <math>\,\!f</math> in <math>\mathcal{F}</math>.<br>
As mentioned above, <math>\,\!\delta_x</math> is linear functional and thus according to Riesz theorem there is an element of <math>\mathcal{F}</math>, denoted by <math>\,\!k_x</math> such that:<br>
As mentioned above, <math>\,\!\delta_x</math> is linear functional and thus according to Riesz theorem there is an element of <math>\mathcal{F}</math>, denoted by <math>\,\!k_x</math> such that:<br>
<math>\delta_x(f)=f(x)=\langle f,k_x\rangle</math> for all <math>f\in\mathcal{F}</math>.
<math>\delta_x(f)=f(x)=\langle f,k_x\rangle</math> for all <math>f\in\mathcal{F}</math>.<br>
For example, for the inner product defined in the integral form above, <math>k_x</math> would be simply a delta Dirac function which peaks at point <math>x</math>.


===Reproducing Kernel Hilbert Spaces (RKHS)===
===Reproducing Kernel Hilbert Spaces (RKHS)===
Reproducing Kernel Hilbert Spaces are a subtype of Hilbert spaces introduced above. A Hilbert space is a RKHS if for all <math>x\in \mathcal{X}</math>, the corresponding Dirac evaluation operator <math>\delta_x:\mathcal{F} \rightarrow \mathbb{R}</math> is a continuous linear functional. Based on this, Reproducing Kernel Hilbert Spaces, in fact, represent a family of Hilbert spaces with smooth real-valued functions as their elements. <br>
Reproducing Kernel Hilbert Spaces are a subtype of Hilbert spaces introduced above. A Hilbert space is a RKHS if for all <math>x\in \mathcal{X}</math>, the corresponding Dirac evaluation operator <math>\delta_x:\mathcal{F} \rightarrow \mathbb{R}</math> is a continuous linear functional. Based on this, Reproducing Kernel Hilbert Spaces, in fact, represent a family of Hilbert spaces with smooth real-valued functions as their elements. <br>
To every RKHS we may assign a symmetric positive definite function <math>k:\,\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}</math>, named '''kernel'''. The kernel function is defined as:<br>
To every RKHS we may assign a symmetric positive definite function <math>k:\,\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}</math>, named '''kernel'''. Using the Dirac evaluation functional, the kernel function is defined as:<br>
<math>\,\!k(x,y):=k_x(y)</math><br>
<math>\,\!k(x,y):=k_x(y)</math><br>
It is not hard to see that:<br>
It is not hard to see that:<br>
Line 44: Line 44:
Now we turn our attention back to the the problem of measuring independence between two (generally multivariate) random variables <math>\,\!x</math> and <math>\,\!y</math>. Induced by the idea described in the introduction section, we define RKHS <math>\mathcal{F}</math> from <math>\mathcal{X}</math> to <math>\mathbb{R}</math> containing all continuous bounded real-valued functions of <math>\,\!x</math>, and RKHS <math>\mathcal{G}</math> from <math>\mathcal{Y}</math> to <math>\mathbb{R}</math> containing all continuous bounded real-valued functions of <math>\,\!y</math>. Here, <math>\mathcal{X}</math> and <math>\mathcal{Y}</math> denotes the support (set of possible values) of random variables <math>\,\!x</math> and <math>\,\!y</math> respectively. As mentioned before, we are particularly interested in the cross-covariance between elements of <math>\mathcal{F}</math> and <math>\mathcal{G}</math>:<br>
Now we turn our attention back to the the problem of measuring independence between two (generally multivariate) random variables <math>\,\!x</math> and <math>\,\!y</math>. Induced by the idea described in the introduction section, we define RKHS <math>\mathcal{F}</math> from <math>\mathcal{X}</math> to <math>\mathbb{R}</math> containing all continuous bounded real-valued functions of <math>\,\!x</math>, and RKHS <math>\mathcal{G}</math> from <math>\mathcal{Y}</math> to <math>\mathbb{R}</math> containing all continuous bounded real-valued functions of <math>\,\!y</math>. Here, <math>\mathcal{X}</math> and <math>\mathcal{Y}</math> denotes the support (set of possible values) of random variables <math>\,\!x</math> and <math>\,\!y</math> respectively. As mentioned before, we are particularly interested in the cross-covariance between elements of <math>\mathcal{F}</math> and <math>\mathcal{G}</math>:<br>
<math>\text{cov}(f(x),g(y))=\mathbf{E}_{x,y}[f(x)g(y)]-\mathbf{E}_{x}[f(x)]\mathbf{E}_{y}[g(y)]</math><br>
<math>\text{cov}(f(x),g(y))=\mathbf{E}_{x,y}[f(x)g(y)]-\mathbf{E}_{x}[f(x)]\mathbf{E}_{y}[g(y)]</math><br>
In functional Analysis, an ''operator'' is a name given to elements from one Hilbert space to elements of the other one. It may be shown that there exist a unique operator <math>C_{x,y}</math> mapping elements of <math>\mathcal{F}</math> to elements of <math>\mathcal{G}</math> such that:<br>
In functional Analysis, an ''operator'' is a name given to a mapping which maps elements from one Hilbert space to elements of the other one. It may be shown that there exist a unique operator <math>\,\!C_{x,y}:\mathcal{G}\rightarrow \mathcal{F}</math> mapping elements of <math>\mathcal{G}</math> to elements of <math>\mathcal{F}</math> such that:<br>
<math>\langle f,C_{x,y}(g)\rangle=\text{cov}(f,g)</math><br>
<math>\langle f,C_{x,y}(g)\rangle_{\mathcal{F}}=\text{cov}(f,g)</math><br>
for all <math>f\in \mathcal{F}</math> and <math>g\in \mathcal{G}</math>. This operator is called '''Cross-covariance operator'''. <br>
for all <math>f\in \mathcal{F}</math> and <math>g\in \mathcal{G}</math>. This operator is called '''Cross-covariance operator'''. <br>
We may define the concept of norm for an operator. For example, consider an operator in the form of a matrix <math>C_{m\times n}</math> mapping vectors <math>n\times 1</math> in <math>\mathbb{R}^n</math> to vectors <math>m\times 1</math> in <math>\mathbb{R}^m</math>. Then the Frobenius norm of this matrix may be defined as the norm of the corresponding operator.<br> There are different norms defined for operators. One of them is called '''Hilbert-Schmidt (HS) norm''' and is defined as follows:<br>
We may define the concept of norm for an operator. For example, consider an operator in the form of a matrix <math>C_{m\times n}</math> mapping vectors of size <math>n\times 1</math> in <math>\mathbb{R}^n</math> to vectors of size <math>m\times 1</math> in <math>\mathbb{R}^m</math>. Then the Frobenius norm of this matrix may be defined as the norm of the corresponding operator.<br> There are different norms defined for operators. One of them is called '''Hilbert-Schmidt (HS) norm''' and is defined as follows:<br>
<math>\|C\|^2_{HS}:=\sum_{i,j}\langle Cv_i,u_j \rangle^2_{\mathcal{F}}</math><br>
<math>\|C\|^2_{HS}:=\sum_{i,j}\langle Cv_i,u_j \rangle^2_{\mathcal{F}}</math><br>
where <math>\,\!u_j</math> and <math>\,\!v_i</math> are orthogonal bases of <math>\mathcal{F}</math> and <math>\mathcal{G}</math>, respectively. It is easy to see that the Frobenius norm on matrices may be considered a spacial case of this norm.<br>
where <math>\,\!u_j</math> and <math>\,\!v_i</math> are orthogonal bases of <math>\mathcal{F}</math> and <math>\mathcal{G}</math>, respectively. It is easy to see that the Frobenius norm on matrices may be considered a spacial case of this norm.<br><br>
Now, we are in a position to define our measure for dependence of two random variables. The measure is the '''"Hilbert-Schmidt norm of the cross-covariance operator"''':<br>
Now, we are in a position to define our measure for dependence of two random variables. The measure is the '''"Hilbert-Schmidt norm of the cross-covariance operator"''':<br>
<math>\text{HSIC}(p_{xy},\mathcal{F},\mathcal{G}):=\|C_{xy}\|^2_{HS}</math>
<math>\text{HSIC}(p_{xy},\mathcal{F},\mathcal{G}):=\|C_{xy}\|^2_{HS}</math><br>
'''We may note that if <math>\|C_{xy}\|^2_{HS}</math> is zero, then the value of <math>\langle f,C_{x,y}(g)\rangle</math>, i.e., <math>\,\!\text{cov}(f,g)</math>, will be always zero for any <math>f\in\ \mathcal{F}</math> and <math>g\in \mathcal{G}</math> and thus based on the theorem given in the introduction section, the random variables <math>\,\!x</math> and <math>\,\!y</math> will be independent.''' We will return to this important fact again in the section "Independence test using HSIC" below.


==RKHS Theory ==
==HSIC advantages ==
Let <math>\mathcal{F}</math> be a Hilbert space from <math>\mathcal{X}</math> to <math>\mathbb{R}</math>. We assume <math>\mathcal{F}</math> is a Reproducing Kernel Hilbert Space,i.e., for all <math>x\in \mathcal{X}</math>, the corresponding Dirac evaluation operator <math>\delta_x:\mathcal{F} \rightarrow \mathbb{R}</math> is a bounded (or equivalently continuous) linear functional. We denote the kernel of this operator by <math>k(x,x')=\langle \phi(x)\phi(x') \rangle_{\mathcal{F}}</math> where <math>k:\mathcal{X}\times \mathcal{X}\rightarrow \mathbb{R} </math> is a positive definite function and <math>\,\!\phi </math> is the feature map of <math>\mathcal{F}</math>. Similarly, we consider another RKHS named <math>\mathcal{G}</math> with Domain <math>\mathcal{Y}</math>, kernel <math>l(\cdot,\cdot)</math> and feature map <math>\,\!\psi </math>. We assume both <math>\mathcal{F}</math> and <math>\mathcal{G}</math> are separable, i.e., they have a complete orthogonal bases.


===Hilbert-Schmidt Norm ===
HSIC has several advantages:
For a linear operator <math>C:\mathcal{G}\rightarrow \mathcal{F}</math>, provided the sum converges, the Hilbert-Schmidt (HS) norm is defined as:<br>
First, Compared to the other measures, it is more straightforward and extra regularization terms are not needed.
<math>\|C\|^2_{HS}:=\sum_{i,j}\langle Cv_i,u_j \rangle^2_{\mathcal{F}}</math><br>
Second, as the sample size increases, any existing dependence with high probability will be detected. This is not true in the case of any other kernel dependence test and happens just for the covariance constraints.
where <math>\,\!u_j</math> and <math>\,\!v_i</math> are orthogonal bases of <math>\mathcal{F}</math> and <math>\mathcal{G}</math>, respectively. It is easy to see that the Frobenius norm on matrices may be considered a spacial case of this norm.
Third, the finite sample bias of the estimate is negligible and has the order of <math> m^{-1} </math>. This is not always true for the other kernel dependence tests.
Last, It has been shown in an ICA experiment that this new independence test is better than the others competing with the best of the ICA methods.


===Hilbert-Schmidt Operator ===
==HSIC in terms of kernels ==
A Hilbert-Schmidt Operator is a linear operator for which the Hilbert-Schmidt norm (introduced above) exists.
To compute HSIC we need to express it in terms of kernel functions. It can be shown that this can be achieved via the following identity:<br>
<math>\text{HSIC}(p_{xy},\mathcal{F},\mathcal{G})=\mathbf{E}_{x,x',y,y'}[k(x,x')l(y,y')]+\mathbf{E}_{x,x'}[k(x,x')]\mathbf{E}_{y,y'}[l(y,y')]-2\mathbf{E}_{x,y}[\mathbf{E}_{x'}[k(x,x')]\mathbf{E}_{y'}[l(y,y')]]</math>


===Tensor Product Operator===
Because the data is random, the kernels are random so we use expected values.<br> Following is the proof of this theorem. However, the reader may escape this proof and keep track of the rest of the article without losing the main idea of HSIC.
We may employ any <math>f\in \mathcal{F}</math> and <math>g\in \mathcal{G}</math> to define a tensor product operator <math>f\otimes g:\mathcal{G}\rightarrow\mathcal{F}</math> as follows:<br>
<math>(f\otimes g)h:=f\langle g,h\rangle_{\mathcal{G}} \quad</math> for all <math>h\in\mathcal{G}</math><br>
Using the definition of HS norm introduced above, we can simply show the norm of <math>f\otimes g</math> equals<br>
<math>\|f\otimes g\|^2_{HS}=\|f\|^2_{\mathcal{F}}\; \|g\|^2_{\mathcal{G}}</math>


==Cross-Covariance Operator==
===Proof===
===Mean===
First we need to introduce some concepts:
Mean elements of <math>\mathcal{F}</math> and <math>\mathcal{G}</math> are defined as those elements of these spaces for which<br>
====Mean====
Mean elements of <math>\mathcal{F}</math> and <math>\mathcal{G}</math> are defined as those elements of these spaces for which:<br>


<math>\langle\mu_x,f \rangle_{\mathcal{F}}=\mathbf{E}_x[\langle\phi(x),f \rangle_{\mathcal{F}}]=\mathbf{E}_x[f(x)]</math><br>
<math>\langle\mu_x,f \rangle_{\mathcal{F}}=\mathbf{E}_x[\langle\phi(x),f \rangle_{\mathcal{F}}]=\mathbf{E}_x[f(x)]</math><br>
<math>\langle\mu_y,g \rangle_{\mathcal{G}}=\mathbf{E}_y[\langle\psi(y),g \rangle_{\mathcal{G}}]=\mathbf{E}_y[g(x)]</math><br>
<math>\langle\mu_y,g \rangle_{\mathcal{G}}=\mathbf{E}_y[\langle\psi(y),g \rangle_{\mathcal{G}}]=\mathbf{E}_y[g(x)]</math><br>
Based on this, <math>\|\mu_x\|^2_{\mathcal{F}}</math> may be calculated by applying expectation twice as follows:<br>
Note that according to Reisz representation theorem this means always uniquely exists. One may simply show that <math>\|\mu_x\|^2_{\mathcal{F}}</math> can be calculated by applying expectation twice as follows:<br>
<math>\|\mu_x\|^2_{\mathcal{F}}=\mathbf{E}_{x,x'}[\langle \phi(x),\phi(x')\rangle_{\mathcal{F}}]=\mathbf{E}_{x,x'}[k(x,x')]</math>
<math>\|\mu_x\|^2_{\mathcal{F}}=\mathbf{E}_{x,x'}[\langle \phi(x),\phi(x')\rangle_{\mathcal{F}}]=\mathbf{E}_{x,x'}[k(x,x')]</math>


===Cross-covariance Operator===
====Tensor Product Operator====
Now we are in aposition to define the cross-covariance operator as follows<br>
We may employ any <math>f\in \mathcal{F}</math> and <math>g\in \mathcal{G}</math> to define a tensor product operator <math>f\otimes g:\mathcal{G}\rightarrow\mathcal{F}</math> as follows:<br>
<math>(f\otimes g)h:=f\langle g,h\rangle_{\mathcal{G}} \quad</math> for all <math>h\in\mathcal{G}</math>
 
====Hilbert-Schmidt operators====
A Hilbert-Schmidt (HS) operator is a linear operator <math>C:\mathcal{G}\rightarrow\mathcal{F}</math> if its HS norm exists. We may define an inner product between two HS operator as follows:<br>
<math>\langle C,D\rangle_{HS}:=\sum_{i,j}{\langle Cv_i,u_i\rangle_{\mathcal{F}}\langle Dv_i,u_i\rangle_{\mathcal{F}}}</math><br>
We may simply proof the following lemma for inner product of two tensor product operators:<br>
'''Lemma:''' For any <math>f_1,f_2\in \mathcal{F}</math> and <math>g_1,g_2 \in \mathcal{G}</math> the following equation holds:<br> <math>\langle f_1\otimes g_1,f_2\otimes g_2\rangle_{HS}=\langle f_1,f_2\rangle_F\langle g_1,g_2\rangle_G</math><br>
<br>Using this lemma, one can simply show the norm of <math>f\otimes g</math> equals<br>
<math>\|f\otimes g\|^2_{HS}=\|f\|^2_{\mathcal{F}}\; \|g\|^2_{\mathcal{G}}</math>
 
====Cross-covariance Operator====
The cross-covariance operator, introduced above, may be equivalently expressed as follows:<br>
<math>C_{xy}:=\underbrace{\mathbf{E}_{x,y}[\phi(x)\otimes\psi(y)]}_{:=\tilde{C}_{xy}}-\underbrace{\mu_x\otimes\mu_y}_{:=M_{xy}}</math><br>
<math>C_{xy}:=\underbrace{\mathbf{E}_{x,y}[\phi(x)\otimes\psi(y)]}_{:=\tilde{C}_{xy}}-\underbrace{\mu_x\otimes\mu_y}_{:=M_{xy}}</math><br>
We will use <math>\tilde{C}_{xy}</math> and <math>\,\!M_{xy}</math> as the basis of our measure of dependence.
This expression comes from the linearity of expectation. Auxiliary variables <math>\tilde{C}_{xy}</math> and <math>\,\!M_{xy}</math> will be used as the basis of our measure of dependence.  
 
====Body of the proof====
Expanding <math>\,\!C_{xy}</math> using the above formula, we have:<br>
<math>\|C_{xy}\|^2_{HS}=\langle \tilde{C}_{xy}-M_{xy},\tilde{C}_{xy}-M_{xy}\rangle_{HS}=\mathbf{E}_{x,y,x',y'}[\langle \phi(x)\otimes\psi(y),\phi(x')\otimes\psi(y')\rangle_{HS}]</math><br><math>-2\mathbf{E}_{x,y}[\langle \mu_x\otimes\mu_y,\phi(x')\otimes\psi(y')\rangle_{HS}]+\langle \mu_x\otimes\mu_y,\mu_x\otimes\mu_y\rangle_{HS}</math><br>
applying the above lemma and using definitions of <math>\,\!\mu_x</math> and <math>\,\!\mu_y</math> and the fact that <math>\langle \phi(x)\phi(x')\rangle_F=k(x,x')</math>, we arrive at the kernel representation of HSIC.
 
==Independence Test using HSIC==
In the previous section, we saw that if two RKHS (containing ''all'' continuous real-valued functions on <math>\mathcal{X}</math>) are given along with an inner product defined on their elements, we may use their corresponding kernels to evaluate the HSIC. In particular, if the HSIC obtained from kernel representation equals zero we may make sure that the underlying random variables are independent. In this section, we address another question: If two kernel functions are already given, can we use the HSIC given in terms of kernel functions to decide weather the underlying random variables are independent.<br>
The answer to this question is not hard. As we mentioned in the "Basic concepts" sections each kernel of the form <math>\mathcal{X}\times \mathcal{X} \rightarrow \mathbb{R}</math> has a one-to-one correspondense with a RKHS. Based on this all we need from the applied kernel is that the RKHS corresponding to it must contain (almost) all continuous real-valued functions on <math>\mathcal{X}</math>. That is, in fact, the concept of a ''universal'' kernel. For more details see this<ref> A. Gretton, A. Smola, O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Sch¨olkopf, and N. Logothetis, Kernel constrained covariance for dependence measurement, AISTATS, vol. 10, 2005.</ref>. Now we present a theorem indicating this fact via an exact mathematical framework:<br>
 
'''Theorem:''' Denote by <math>\mathcal{F}</math>, <math>\mathcal{G}</math> RKHSs with universal
kernels <math>\,\!k</math>, <math>\,\!l</math> on the compact domains <math>\mathcal{X}</math> and <math>\mathcal{Y}</math> respectively. We assume without
loss of generality that <math>\|f\|_{\infty}\leq 1</math> and <math>\|g\|_{\infty}\leq 1</math> for all <math>f \in \mathcal{F}</math> and <math>g \in \mathcal{G}</math>. Then
<math>\|C_{xy}\|_{HS} =0</math> if and only if <math>\,\!x</math> and <math>\,\!y</math> are independent.<br>
 
==Empirical Criterion==
===Definition (Empirical HSIC)===
Let <math>Z:=\{(x_1,y_1),\cdots,(x_m,y_m)\}\subseteq \mathcal{X}\times\mathcal{Y}</math> be a series of <math>\,\!m</math> independent observations drawn from <math>p_{x\,\!y}</math>. An estimator of HSIC, is given by<br>
<math>\text{HSIC}(Z,\mathcal{F},\mathcal{G}):=(m-1)^{-2}\textbf{tr}(KHLH)</math><br>
where <math>H, K, L\in\mathbb{R}^{m\times m}, K_{ij}:=k(x_i,x_j), L_{ij}:=l(y_i,y_j) \,\, \text{and} \, \, H_{ij}:=\delta_{ij}-m^{-1}</math>.<br>
Based on this result, to maximize the dependence between two kernels we need to increase the value of the empirical estimate, i.e., <math>\,\!\textbf{tr}(KHLH)</math>. We will see some applications of this formulation in the application section.<br>
It is interesting to note that if one of the kernels <math>\,\!K</math> or <math>\,\!L</math> is already centered, say <math>\,L</math>, then <math>\,HLH=L</math> and thus we may simply use the objective function <math>\,\!\textbf{tr}(KL)</math> which no longer includes the centering matrix <math>\,H</math>. Similarly, if <math>\,HKH=K</math> we may rewrite <math>\,\text{Tr}[KHLH]=\text{Tr}[HKHL]</math> and arrive at the same result.
 
==Bias of Estimator==
It may bee shown that the bias of the above empirical estimation is of the order <math>\,\!O(m^{-1})</math>:<br>
'''Theorem:''' Let <math>\mathbf{E}_Z</math> denote the expectation taken over <math>\,\!m</math> independent copies <math>\,\!(x_i,y_i)</math> drawn from <math>p_{\,\!xy}</math>. Then<br>
<math>\text{HSIC}(p_{xy},\mathcal{F},\mathcal{G})=\mathbf{E}_Z[\text{HSIC}(Z,\mathcal{F},\mathcal{G})]+O(m^{-1})</math>.
 
==Bound on Empirical HSIC==
'''Theorem:''' Assume that <math>\,\!k</math> and <math>\,\!l</math> are bounded almost everywhere by 1, and are non-negative. Then for <math>\,\!m>1</math> and all <math>\,\!\delta>0</math>, with probablity at least <math>\,\!1-\delta</math>, for all <math>\,\!p_{xy}</math> :<br>
<math>|\text{HSIC}(p_{xy},\mathcal{F},\mathcal{G})-\text{HSIC}(Z,\mathcal{F},\mathcal{G})|\leq\sqrt{\frac{log(6/\delta)}{\alpha^2m}}+\frac{C}{m}</math><br>
where <math>\,\!\alpha>0.24</math> and <math>\,\!C</math> are constants.
 
 
 
==Efficient computation in HSIC==
Computational cost is a factor in choosing HSIC. Using a low rank decomposition of the Gram matrices via an incomplete Choleskey decomposition results a cost saving according to the following lemma:<br />
''Lemma 2 (Efficinet approximation to HSIC): Let'' <math> K\approx AA^T \, and \, L \approx BB^T, \, where \, A\in R^{m*d_f} \, and \, B\in R^{m*d_g}</math>. ''then we may approximate'' <math>tr(HKHL) \, in  \, o(m({d_f}^2+{d_g}^2))</math> ''time.''
==Applications==
===Colored Maximum Variance Unfolding (Colored MVU)===
In the original version of the MVU, a kernel <math>\,K</math> is learned via maximizing <math>\,\text{Tr}[K]</math> subject to some constraints. Now assume we have some similarity/dissimilarity relationships between data points stored in matrix <math>B</math>. We may modify the original form of MVU by maximizing <math>\,\text{Tr}[KB]</math> instead of <math>\,\text{Tr}[K]</math> subject to the same constraints. In this way, while keeping the pairwise distances between data points (as in the original form of MVU) we simultaneously maximize the dependence between the learned kernel <math>\,\!K</math> and kernel <math>\,\!B</math> storing our side information. Note that, in this formulation, we centeralize matrix <math>\,\!K</math> via one of the consraines in the problem and thus can exclude matrix <math>\,\!H</math> in empirical estimation of HSIC.
===Principle Components Analysis (PCA)===


==Hilbert-Schmidt Independence Criterion==
As known, in the PCA formulation <math>X\approx WH</math>, matrix <math>\,\!H</math> may be thought of as a low dimentional representation of <math>\,\!X</math>. The objective function in PCA may be written as follows:<br>
'''Definition (HSIC)''' Given separable RKHSs <math>\mathcal{F}</math>, <math>\mathcal{G}</math> and a joint probability <math>p_{x\,\!y}</math>, we define the Hilbert-Schmidt Independence Criterion(HSIC) as the squared HS-norm of the associated cross-covariance operator <math>C_{x\,\!y}</math> <br>
<math>\max_H \,\text{Tr}[HX^TXH^T]=\text{Tr}[(X^TX)(H^TH)]\,\,</math>    subject to <math>\,HH^T=I</math>.<br> As marked in the right side of the equation above, this expression may be interpreted as finding matrix <math>\,\!H</math> such that the linear kernel obtained from that has maximum dependence with the kernel obtained from the original data matrix <math>\,\!X</math>. Note that here we again assume that the data matrix <math>\,\!X</math> is already centralized and thus we do not need to include matrix <math>\,\!H</math> in our HSIC estimate.
<math>\text{HSIC}(p_{xy},\mathcal{F},\mathcal{G}):=\|C_{xy}\|^2_{HS}</math>
===HSIC in terms of kernels ===
To compute HSIC we need to express it in terms of kernel functions. It can be shown that this can be achieved via the following identity:<br>
<math>\text{HSIC}(p_{xy},\mathcal{F},\mathcal{G})=\mathbf{E}_{x,x',y,y'}[k(x,x')l(y,y')]+\mathbf{E}_{x,x'}[k(x,x')]\mathbf{E}_{y,y'}[l(y,y')]-2\mathbf{E}_{x,y}[\mathbf{E}_{x'}[k(x,x')]\mathbf{E}_{y'}[l(y,y')]]</math>


Because the data is random, the kernels are random so we use expected values.
===Non-negative matrix factorization (NMF)===
NMF with orthognal constraint on H may be expressed as <math>\,\! \max_{H>=0} Tr((X^{T}X)(H^{T}H))</math>


==Related work==
==Related work==
Line 110: Line 160:
====This paper (HSIC) ====
====This paper (HSIC) ====
The paper is indeed an extension of '''COCO''', in the sense that the paper looks at the entire spectrum of the cross-covariance operator to determine whether all its singular values are zero, instead of looking at only the largest singular value, which is what '''COCO''' does. It is thus intuitively obvious that a more robust indication of independence than '''COCO''' can be obtained. More specifically, the measure of independence is the '''sum of the squared singular values of the cross-covariance operator'''(its squared Hilbert-Schmidt norm), which the authors termed HSIC(Hilbert Schmidt Independence Criterion).
The paper is indeed an extension of '''COCO''', in the sense that the paper looks at the entire spectrum of the cross-covariance operator to determine whether all its singular values are zero, instead of looking at only the largest singular value, which is what '''COCO''' does. It is thus intuitively obvious that a more robust indication of independence than '''COCO''' can be obtained. More specifically, the measure of independence is the '''sum of the squared singular values of the cross-covariance operator'''(its squared Hilbert-Schmidt norm), which the authors termed HSIC(Hilbert Schmidt Independence Criterion).
==Empirical Criterion==
===Definition (Empirical HSIC)===
Let <math>Z:=\{(x_1,y_1),\cdots,(x_m,y_m)\}\subseteq \mathcal{X}\times\mathcal{Y}</math> be a series of <math>\,\!m</math> independent observations drawn from <math>p_{x\,\!y}</math>. An estimator of HSIC, is given by<br>
<math>\text{HSIC}(Z,\mathcal{F},\mathcal{G}):=(m-1)^{-2}\textbf{tr}(KHLH)</math><br>
where <math>H, K, L\in\mathbb{R}^{m\times m}, K_{ij}:=k(x_i,x_j), L_{ij}:=l(y_i,y_j) \,\, \text{and} \, \, H_{ij}:=\delta_{ij}-m^{-1}</math>.
===Bias of Estimator===
It may bee shown that the bias of the above empirical estimation is of the order <math>\,\!O(m^{-1})</math>:<br>
'''Theorem:''' Let <math>\mathbf{E}_Z</math> denote the expectation taken over <math>\,\!m</math> independent copies <math>\,\!(x_i,y_i)</math> drawn from <math>p_{\,\!xy}</math>. Then<br>
<math>\text{HSIC}(p_{xy},\mathcal{F},\mathcal{G})=\mathbf{E}_Z[\text{HSIC}(Z,\mathcal{F},\mathcal{G})]+O(m^{-1})</math>.
==Bound on Empirical HSIC==
'''Theorem:''' Assume that <math>\,\!k</math> and <math>\,\!l</math> are bounded almost everywhere by 1, and are non-negative. Then for <math>\,\!m>1</math> and all <math>\,\!\delta>0</math>, with probablity at least <math>\,\!1-\delta</math>, for all <math>\,\!p_{xy}</math> :<br>
<math>|\text{HSIC}(p_{xy},\mathcal{F},\mathcal{G})-\text{HSIC}(Z,\mathcal{F},\mathcal{G})|\leq\sqrt{\frac{log(6/\delta)}{\alpha^2m}}+\frac{C}{m}</math><br>
where <math>\,\!\alpha>0.24</math> and <math>\,\!C</math> are constants.
==Independence Test using HSIC==
'''Theorem:''' Denote by <math>\mathcal{F}</math>, <math>\mathcal{G}</math> RKHSs with universal
kernels <math>\,\!k</math>, <math>\,\!l</math> on the compact domains <math>\mathcal{X}</math> and <math>\mathcal{Y}</math> respectively. We assume without
loss of generality that <math>\|f\|_{\infty}\leq 1</math> and <math>\|g\|_{\infty}\leq 1</math> for all <math>f \in \mathcal{F}</math> and <math>g \in \mathcal{G}</math>. Then
<math>\|C_{xy}\|_{HS} =0</math> if and only if <math>\,\!x</math> and <math>\,\!y</math> are independent.<br>
Based on this result, to maximize the dependence between two kernels we need to increase the value of the empirical estimate, i.e., <math>\,\!\textbf{tr}(KHLH)</math>.<br>
It can be shown that if even one of the kernels <math>\,\!K</math> or <math>\,\!L</math> is already centered, we may drop the centering matrices <math>\,\!H</math> and simply use the objective function <math>\,\!\textbf{tr}(KL)</math>.


== Kernel Canonical Correlation Analysis (CCA)==
== Kernel Canonical Correlation Analysis (CCA)==
Line 164: Line 188:


<math> \Sigma_{YX} g = \rho_{1} \Sigma_{XX} f </math>
<math> \Sigma_{YX} g = \rho_{1} \Sigma_{XX} f </math>
====References====
<references/>

Latest revision as of 08:45, 30 August 2017

"Hilbert-Schmist Norm of the Cross-Covariance operator" is proposed as an independence criterion in reproducing kernel Hilbert spaces (RKHSs). The measure is refereed to as Hilbert-Schmidt Independence Criterion, or HSIC. An empirical estimate of this measure is introduced which may be well used in many practical application such as independent Component Analysis (ICA), Maximum Variance Unfolding (MVU), feature extraction, feature selection, ... .

Introduction

In this article <ref> Gretton, A., Bousquet, O., Smola, A. J., and Schölkopf, B., Measuring Statistical Dependence with Hilbert-Schmidt Norms, Proceedings Algorithmic Learning Theory, S. Jain and W.-S. Lee (Eds.), 2005.</ref>, a criterion is introduced which is used to measure the dependence between two multivariate random variables. More specifically, we are looking for an appropriate function of two random variables whose output is zero if the involving random variables are independent and achieves a high value if they are statistically dependent. If instead of "independence" we were looking for "uncorrelation" the situation would be much easier to handle. Because "correlation" of two random variables involves only expected values, as opposed to concept of independence between two random variables. However, fortunately, based on the following theorem we still can effectively tackle the problem of independence between two random variables:

Theorem: Two random variables [math]\displaystyle{ \,\!x }[/math] and [math]\displaystyle{ \,\!y }[/math] are independent if and only if any bounded continuous function of the two random variables are uncorrelated. (have zero covariance).

For example, for random variables [math]\displaystyle{ \,\!x }[/math] and [math]\displaystyle{ \,\!y }[/math] to be independent, all pairs [math]\displaystyle{ (x,y),\, (x^2,y),\,(x,e^y), \,(\sin(x),\log(y)),\, ... }[/math] have to be uncorrelated. That is why we need to deal with functions of random variables when examining the independence property. and That is why Hilbert spaces and concepts from "Functional Analysis" are involved in this article (Of course, no one likes to torture him/herself by using complicated mathematical expressions ;)).

How the rest of this note is organized

In the following, we first give an introductory note on the concepts from functional analysis and Hilbert spaces required in this paper. (No prior knowledge of these concepts is assumed here). Then we introduce the independence measure, namely, Hilbert-Schmidt Independence Criterion and explain why we expect it to work properly. Thirdly, we introduce an statistics which may be used to estimate the HSIC and discuss its convergence behavior. Finally, through some examples, we demonstrate how this criterion may be effectively employed to solve many well-known problems.

Simple introduction to Reproducing Kernel Hilbert Spaces

Basic concepts

Let [math]\displaystyle{ \mathcal{F} }[/math] denote the set containing all real-valued functions defined on domain [math]\displaystyle{ \,\!\mathcal{X} }[/math]. By real-valued, we mean the output of these functions is always a real number. (Note that this definition makes good sense considering the methodology mentioned in the introduction section.) We may note that this space is linear, as for any [math]\displaystyle{ f,g\in\mathcal{F} }[/math] and [math]\displaystyle{ a,b\in\mathbb{R} }[/math], [math]\displaystyle{ \,\!af+bg }[/math] will be a member of [math]\displaystyle{ \mathcal{F} }[/math], as well. On the other hand, we may define the concept of real-valued inner product between the elements of [math]\displaystyle{ \mathcal{F} }[/math]. The simplest way to do that probably is the familiar:
[math]\displaystyle{ \langle a,b\rangle=\int_{}^{}{f(t)g(t)dt} }[/math]
However, other definitions for the inner product is also possible. The only required conditions are that:
1. [math]\displaystyle{ \langle f,f\rangle\geq0,\,\forall\, f\in\mathcal{F} }[/math] with [math]\displaystyle{ \langle f,f\rangle=0 }[/math] iff [math]\displaystyle{ \,\!f=0 }[/math].
2. [math]\displaystyle{ \langle f,g\rangle=\langle g,f\rangle,\, \forall \,f,g\in \mathcal{F} }[/math].
3. [math]\displaystyle{ \langle af+bg,h\rangle=a\langle f,h\rangle+b\langle g,h\rangle,\,\forall\,f,g,h\in\mathcal{F} }[/math] and all real [math]\displaystyle{ \,\!a }[/math] and [math]\displaystyle{ \,\!b }[/math].

Endowed with the concept of inner product, the set [math]\displaystyle{ \mathcal{F} }[/math] here is, in fact, a Hilbert space. (In general, a Hilbert space is a (complete) linear space endowed with an inner product defined on the elements with the above properties. However, in this article we only deal with those Hilbert spaces whose elements are real-valued functions on [math]\displaystyle{ \mathcal{X} }[/math].)
A mapping from the Hilbert space to real numbers [math]\displaystyle{ L:\mathcal{F}\rightarrow \mathbb{R} }[/math] is called a functional. A functional is called linear if the underlying mapping is linear, i.e.:
[math]\displaystyle{ \,\!L(af+bg)=aL(f)+bL(g) }[/math]
for all [math]\displaystyle{ \,\!f }[/math] and [math]\displaystyle{ \,\!g }[/math] in [math]\displaystyle{ \mathcal{F} }[/math] and all real [math]\displaystyle{ \,\!a }[/math] and [math]\displaystyle{ \,\!b }[/math]. A simple (and yet very important) example of a functional is a function which gets an element of [math]\displaystyle{ \mathcal{F} }[/math] (which is a real-valued function on [math]\displaystyle{ \mathcal{X} }[/math]) as input and returns its value at a specific fixed point [math]\displaystyle{ \,\!x }[/math]. This functional is called Dirac evaluation functional and is denoted by [math]\displaystyle{ \delta_x:\mathcal{F} \rightarrow \mathbb{R} }[/math]. Thus we have [math]\displaystyle{ \,\!\delta_x(f)=f(x) }[/math]. It can be easily seen that [math]\displaystyle{ \,\!\delta_x }[/math] is a linear functional.
There is theorem on linear functionals named "Riesz representation Theorem" which plays a key role in functional analysis:
Theorem (Riesz representation): If [math]\displaystyle{ \,\!L }[/math] is a bounded linear functional on a Hilbert space [math]\displaystyle{ \mathcal{F} }[/math], then there exists a unique element [math]\displaystyle{ \,\!h_f }[/math] in [math]\displaystyle{ \mathcal{F} }[/math] such that [math]\displaystyle{ L(f)=\langle f,h_f\rangle }[/math] for all [math]\displaystyle{ \,\!f }[/math] in [math]\displaystyle{ \mathcal{F} }[/math].
As mentioned above, [math]\displaystyle{ \,\!\delta_x }[/math] is linear functional and thus according to Riesz theorem there is an element of [math]\displaystyle{ \mathcal{F} }[/math], denoted by [math]\displaystyle{ \,\!k_x }[/math] such that:
[math]\displaystyle{ \delta_x(f)=f(x)=\langle f,k_x\rangle }[/math] for all [math]\displaystyle{ f\in\mathcal{F} }[/math].
For example, for the inner product defined in the integral form above, [math]\displaystyle{ k_x }[/math] would be simply a delta Dirac function which peaks at point [math]\displaystyle{ x }[/math].

Reproducing Kernel Hilbert Spaces (RKHS)

Reproducing Kernel Hilbert Spaces are a subtype of Hilbert spaces introduced above. A Hilbert space is a RKHS if for all [math]\displaystyle{ x\in \mathcal{X} }[/math], the corresponding Dirac evaluation operator [math]\displaystyle{ \delta_x:\mathcal{F} \rightarrow \mathbb{R} }[/math] is a continuous linear functional. Based on this, Reproducing Kernel Hilbert Spaces, in fact, represent a family of Hilbert spaces with smooth real-valued functions as their elements.
To every RKHS we may assign a symmetric positive definite function [math]\displaystyle{ k:\,\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R} }[/math], named kernel. Using the Dirac evaluation functional, the kernel function is defined as:
[math]\displaystyle{ \,\!k(x,y):=k_x(y) }[/math]
It is not hard to see that:
[math]\displaystyle{ k(x,y)=k_x(y)=\langle k_x,k_y\rangle }[/math]
It may be shown that there is a unique correspondence between a RKHS and its kernel. In other words, given a Hilbert space, its kernel function would be identified uniquely; and inversely, if [math]\displaystyle{ \,\!k }[/math] is symmetric positive definite function of the form [math]\displaystyle{ k:\mathcal{X}\times \mathcal{X}\rightarrow \mathbb{R} }[/math], then there is a unique Hilbert space of functions on [math]\displaystyle{ \mathcal{X} }[/math] for which [math]\displaystyle{ \,\!k }[/math] is a reproducing kernel.
The elements of the RKHSs we consider here are always the same, i.e., the smooth real-valued functions on a set [math]\displaystyle{ \mathcal{X} }[/math]. However, they may have different inner products defined and thus different types of kernels.

The independence measure

Now we turn our attention back to the the problem of measuring independence between two (generally multivariate) random variables [math]\displaystyle{ \,\!x }[/math] and [math]\displaystyle{ \,\!y }[/math]. Induced by the idea described in the introduction section, we define RKHS [math]\displaystyle{ \mathcal{F} }[/math] from [math]\displaystyle{ \mathcal{X} }[/math] to [math]\displaystyle{ \mathbb{R} }[/math] containing all continuous bounded real-valued functions of [math]\displaystyle{ \,\!x }[/math], and RKHS [math]\displaystyle{ \mathcal{G} }[/math] from [math]\displaystyle{ \mathcal{Y} }[/math] to [math]\displaystyle{ \mathbb{R} }[/math] containing all continuous bounded real-valued functions of [math]\displaystyle{ \,\!y }[/math]. Here, [math]\displaystyle{ \mathcal{X} }[/math] and [math]\displaystyle{ \mathcal{Y} }[/math] denotes the support (set of possible values) of random variables [math]\displaystyle{ \,\!x }[/math] and [math]\displaystyle{ \,\!y }[/math] respectively. As mentioned before, we are particularly interested in the cross-covariance between elements of [math]\displaystyle{ \mathcal{F} }[/math] and [math]\displaystyle{ \mathcal{G} }[/math]:
[math]\displaystyle{ \text{cov}(f(x),g(y))=\mathbf{E}_{x,y}[f(x)g(y)]-\mathbf{E}_{x}[f(x)]\mathbf{E}_{y}[g(y)] }[/math]
In functional Analysis, an operator is a name given to a mapping which maps elements from one Hilbert space to elements of the other one. It may be shown that there exist a unique operator [math]\displaystyle{ \,\!C_{x,y}:\mathcal{G}\rightarrow \mathcal{F} }[/math] mapping elements of [math]\displaystyle{ \mathcal{G} }[/math] to elements of [math]\displaystyle{ \mathcal{F} }[/math] such that:
[math]\displaystyle{ \langle f,C_{x,y}(g)\rangle_{\mathcal{F}}=\text{cov}(f,g) }[/math]
for all [math]\displaystyle{ f\in \mathcal{F} }[/math] and [math]\displaystyle{ g\in \mathcal{G} }[/math]. This operator is called Cross-covariance operator.
We may define the concept of norm for an operator. For example, consider an operator in the form of a matrix [math]\displaystyle{ C_{m\times n} }[/math] mapping vectors of size [math]\displaystyle{ n\times 1 }[/math] in [math]\displaystyle{ \mathbb{R}^n }[/math] to vectors of size [math]\displaystyle{ m\times 1 }[/math] in [math]\displaystyle{ \mathbb{R}^m }[/math]. Then the Frobenius norm of this matrix may be defined as the norm of the corresponding operator.
There are different norms defined for operators. One of them is called Hilbert-Schmidt (HS) norm and is defined as follows:
[math]\displaystyle{ \|C\|^2_{HS}:=\sum_{i,j}\langle Cv_i,u_j \rangle^2_{\mathcal{F}} }[/math]
where [math]\displaystyle{ \,\!u_j }[/math] and [math]\displaystyle{ \,\!v_i }[/math] are orthogonal bases of [math]\displaystyle{ \mathcal{F} }[/math] and [math]\displaystyle{ \mathcal{G} }[/math], respectively. It is easy to see that the Frobenius norm on matrices may be considered a spacial case of this norm.

Now, we are in a position to define our measure for dependence of two random variables. The measure is the "Hilbert-Schmidt norm of the cross-covariance operator":
[math]\displaystyle{ \text{HSIC}(p_{xy},\mathcal{F},\mathcal{G}):=\|C_{xy}\|^2_{HS} }[/math]
We may note that if [math]\displaystyle{ \|C_{xy}\|^2_{HS} }[/math] is zero, then the value of [math]\displaystyle{ \langle f,C_{x,y}(g)\rangle }[/math], i.e., [math]\displaystyle{ \,\!\text{cov}(f,g) }[/math], will be always zero for any [math]\displaystyle{ f\in\ \mathcal{F} }[/math] and [math]\displaystyle{ g\in \mathcal{G} }[/math] and thus based on the theorem given in the introduction section, the random variables [math]\displaystyle{ \,\!x }[/math] and [math]\displaystyle{ \,\!y }[/math] will be independent. We will return to this important fact again in the section "Independence test using HSIC" below.

HSIC advantages

HSIC has several advantages: First, Compared to the other measures, it is more straightforward and extra regularization terms are not needed. Second, as the sample size increases, any existing dependence with high probability will be detected. This is not true in the case of any other kernel dependence test and happens just for the covariance constraints. Third, the finite sample bias of the estimate is negligible and has the order of [math]\displaystyle{ m^{-1} }[/math]. This is not always true for the other kernel dependence tests. Last, It has been shown in an ICA experiment that this new independence test is better than the others competing with the best of the ICA methods.

HSIC in terms of kernels

To compute HSIC we need to express it in terms of kernel functions. It can be shown that this can be achieved via the following identity:
[math]\displaystyle{ \text{HSIC}(p_{xy},\mathcal{F},\mathcal{G})=\mathbf{E}_{x,x',y,y'}[k(x,x')l(y,y')]+\mathbf{E}_{x,x'}[k(x,x')]\mathbf{E}_{y,y'}[l(y,y')]-2\mathbf{E}_{x,y}[\mathbf{E}_{x'}[k(x,x')]\mathbf{E}_{y'}[l(y,y')]] }[/math]

Because the data is random, the kernels are random so we use expected values.
Following is the proof of this theorem. However, the reader may escape this proof and keep track of the rest of the article without losing the main idea of HSIC.

Proof

First we need to introduce some concepts:

Mean

Mean elements of [math]\displaystyle{ \mathcal{F} }[/math] and [math]\displaystyle{ \mathcal{G} }[/math] are defined as those elements of these spaces for which:

[math]\displaystyle{ \langle\mu_x,f \rangle_{\mathcal{F}}=\mathbf{E}_x[\langle\phi(x),f \rangle_{\mathcal{F}}]=\mathbf{E}_x[f(x)] }[/math]
[math]\displaystyle{ \langle\mu_y,g \rangle_{\mathcal{G}}=\mathbf{E}_y[\langle\psi(y),g \rangle_{\mathcal{G}}]=\mathbf{E}_y[g(x)] }[/math]
Note that according to Reisz representation theorem this means always uniquely exists. One may simply show that [math]\displaystyle{ \|\mu_x\|^2_{\mathcal{F}} }[/math] can be calculated by applying expectation twice as follows:
[math]\displaystyle{ \|\mu_x\|^2_{\mathcal{F}}=\mathbf{E}_{x,x'}[\langle \phi(x),\phi(x')\rangle_{\mathcal{F}}]=\mathbf{E}_{x,x'}[k(x,x')] }[/math]

Tensor Product Operator

We may employ any [math]\displaystyle{ f\in \mathcal{F} }[/math] and [math]\displaystyle{ g\in \mathcal{G} }[/math] to define a tensor product operator [math]\displaystyle{ f\otimes g:\mathcal{G}\rightarrow\mathcal{F} }[/math] as follows:
[math]\displaystyle{ (f\otimes g)h:=f\langle g,h\rangle_{\mathcal{G}} \quad }[/math] for all [math]\displaystyle{ h\in\mathcal{G} }[/math]

Hilbert-Schmidt operators

A Hilbert-Schmidt (HS) operator is a linear operator [math]\displaystyle{ C:\mathcal{G}\rightarrow\mathcal{F} }[/math] if its HS norm exists. We may define an inner product between two HS operator as follows:
[math]\displaystyle{ \langle C,D\rangle_{HS}:=\sum_{i,j}{\langle Cv_i,u_i\rangle_{\mathcal{F}}\langle Dv_i,u_i\rangle_{\mathcal{F}}} }[/math]
We may simply proof the following lemma for inner product of two tensor product operators:
Lemma: For any [math]\displaystyle{ f_1,f_2\in \mathcal{F} }[/math] and [math]\displaystyle{ g_1,g_2 \in \mathcal{G} }[/math] the following equation holds:
[math]\displaystyle{ \langle f_1\otimes g_1,f_2\otimes g_2\rangle_{HS}=\langle f_1,f_2\rangle_F\langle g_1,g_2\rangle_G }[/math]

Using this lemma, one can simply show the norm of [math]\displaystyle{ f\otimes g }[/math] equals
[math]\displaystyle{ \|f\otimes g\|^2_{HS}=\|f\|^2_{\mathcal{F}}\; \|g\|^2_{\mathcal{G}} }[/math]

Cross-covariance Operator

The cross-covariance operator, introduced above, may be equivalently expressed as follows:
[math]\displaystyle{ C_{xy}:=\underbrace{\mathbf{E}_{x,y}[\phi(x)\otimes\psi(y)]}_{:=\tilde{C}_{xy}}-\underbrace{\mu_x\otimes\mu_y}_{:=M_{xy}} }[/math]
This expression comes from the linearity of expectation. Auxiliary variables [math]\displaystyle{ \tilde{C}_{xy} }[/math] and [math]\displaystyle{ \,\!M_{xy} }[/math] will be used as the basis of our measure of dependence.

Body of the proof

Expanding [math]\displaystyle{ \,\!C_{xy} }[/math] using the above formula, we have:
[math]\displaystyle{ \|C_{xy}\|^2_{HS}=\langle \tilde{C}_{xy}-M_{xy},\tilde{C}_{xy}-M_{xy}\rangle_{HS}=\mathbf{E}_{x,y,x',y'}[\langle \phi(x)\otimes\psi(y),\phi(x')\otimes\psi(y')\rangle_{HS}] }[/math]
[math]\displaystyle{ -2\mathbf{E}_{x,y}[\langle \mu_x\otimes\mu_y,\phi(x')\otimes\psi(y')\rangle_{HS}]+\langle \mu_x\otimes\mu_y,\mu_x\otimes\mu_y\rangle_{HS} }[/math]
applying the above lemma and using definitions of [math]\displaystyle{ \,\!\mu_x }[/math] and [math]\displaystyle{ \,\!\mu_y }[/math] and the fact that [math]\displaystyle{ \langle \phi(x)\phi(x')\rangle_F=k(x,x') }[/math], we arrive at the kernel representation of HSIC.

Independence Test using HSIC

In the previous section, we saw that if two RKHS (containing all continuous real-valued functions on [math]\displaystyle{ \mathcal{X} }[/math]) are given along with an inner product defined on their elements, we may use their corresponding kernels to evaluate the HSIC. In particular, if the HSIC obtained from kernel representation equals zero we may make sure that the underlying random variables are independent. In this section, we address another question: If two kernel functions are already given, can we use the HSIC given in terms of kernel functions to decide weather the underlying random variables are independent.
The answer to this question is not hard. As we mentioned in the "Basic concepts" sections each kernel of the form [math]\displaystyle{ \mathcal{X}\times \mathcal{X} \rightarrow \mathbb{R} }[/math] has a one-to-one correspondense with a RKHS. Based on this all we need from the applied kernel is that the RKHS corresponding to it must contain (almost) all continuous real-valued functions on [math]\displaystyle{ \mathcal{X} }[/math]. That is, in fact, the concept of a universal kernel. For more details see this<ref> A. Gretton, A. Smola, O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Sch¨olkopf, and N. Logothetis, Kernel constrained covariance for dependence measurement, AISTATS, vol. 10, 2005.</ref>. Now we present a theorem indicating this fact via an exact mathematical framework:

Theorem: Denote by [math]\displaystyle{ \mathcal{F} }[/math], [math]\displaystyle{ \mathcal{G} }[/math] RKHSs with universal kernels [math]\displaystyle{ \,\!k }[/math], [math]\displaystyle{ \,\!l }[/math] on the compact domains [math]\displaystyle{ \mathcal{X} }[/math] and [math]\displaystyle{ \mathcal{Y} }[/math] respectively. We assume without loss of generality that [math]\displaystyle{ \|f\|_{\infty}\leq 1 }[/math] and [math]\displaystyle{ \|g\|_{\infty}\leq 1 }[/math] for all [math]\displaystyle{ f \in \mathcal{F} }[/math] and [math]\displaystyle{ g \in \mathcal{G} }[/math]. Then [math]\displaystyle{ \|C_{xy}\|_{HS} =0 }[/math] if and only if [math]\displaystyle{ \,\!x }[/math] and [math]\displaystyle{ \,\!y }[/math] are independent.

Empirical Criterion

Definition (Empirical HSIC)

Let [math]\displaystyle{ Z:=\{(x_1,y_1),\cdots,(x_m,y_m)\}\subseteq \mathcal{X}\times\mathcal{Y} }[/math] be a series of [math]\displaystyle{ \,\!m }[/math] independent observations drawn from [math]\displaystyle{ p_{x\,\!y} }[/math]. An estimator of HSIC, is given by
[math]\displaystyle{ \text{HSIC}(Z,\mathcal{F},\mathcal{G}):=(m-1)^{-2}\textbf{tr}(KHLH) }[/math]
where [math]\displaystyle{ H, K, L\in\mathbb{R}^{m\times m}, K_{ij}:=k(x_i,x_j), L_{ij}:=l(y_i,y_j) \,\, \text{and} \, \, H_{ij}:=\delta_{ij}-m^{-1} }[/math].
Based on this result, to maximize the dependence between two kernels we need to increase the value of the empirical estimate, i.e., [math]\displaystyle{ \,\!\textbf{tr}(KHLH) }[/math]. We will see some applications of this formulation in the application section.
It is interesting to note that if one of the kernels [math]\displaystyle{ \,\!K }[/math] or [math]\displaystyle{ \,\!L }[/math] is already centered, say [math]\displaystyle{ \,L }[/math], then [math]\displaystyle{ \,HLH=L }[/math] and thus we may simply use the objective function [math]\displaystyle{ \,\!\textbf{tr}(KL) }[/math] which no longer includes the centering matrix [math]\displaystyle{ \,H }[/math]. Similarly, if [math]\displaystyle{ \,HKH=K }[/math] we may rewrite [math]\displaystyle{ \,\text{Tr}[KHLH]=\text{Tr}[HKHL] }[/math] and arrive at the same result.

Bias of Estimator

It may bee shown that the bias of the above empirical estimation is of the order [math]\displaystyle{ \,\!O(m^{-1}) }[/math]:
Theorem: Let [math]\displaystyle{ \mathbf{E}_Z }[/math] denote the expectation taken over [math]\displaystyle{ \,\!m }[/math] independent copies [math]\displaystyle{ \,\!(x_i,y_i) }[/math] drawn from [math]\displaystyle{ p_{\,\!xy} }[/math]. Then
[math]\displaystyle{ \text{HSIC}(p_{xy},\mathcal{F},\mathcal{G})=\mathbf{E}_Z[\text{HSIC}(Z,\mathcal{F},\mathcal{G})]+O(m^{-1}) }[/math].

Bound on Empirical HSIC

Theorem: Assume that [math]\displaystyle{ \,\!k }[/math] and [math]\displaystyle{ \,\!l }[/math] are bounded almost everywhere by 1, and are non-negative. Then for [math]\displaystyle{ \,\!m\gt 1 }[/math] and all [math]\displaystyle{ \,\!\delta\gt 0 }[/math], with probablity at least [math]\displaystyle{ \,\!1-\delta }[/math], for all [math]\displaystyle{ \,\!p_{xy} }[/math] :
[math]\displaystyle{ |\text{HSIC}(p_{xy},\mathcal{F},\mathcal{G})-\text{HSIC}(Z,\mathcal{F},\mathcal{G})|\leq\sqrt{\frac{log(6/\delta)}{\alpha^2m}}+\frac{C}{m} }[/math]
where [math]\displaystyle{ \,\!\alpha\gt 0.24 }[/math] and [math]\displaystyle{ \,\!C }[/math] are constants.


Efficient computation in HSIC

Computational cost is a factor in choosing HSIC. Using a low rank decomposition of the Gram matrices via an incomplete Choleskey decomposition results a cost saving according to the following lemma:
Lemma 2 (Efficinet approximation to HSIC): Let [math]\displaystyle{ K\approx AA^T \, and \, L \approx BB^T, \, where \, A\in R^{m*d_f} \, and \, B\in R^{m*d_g} }[/math]. then we may approximate [math]\displaystyle{ tr(HKHL) \, in \, o(m({d_f}^2+{d_g}^2)) }[/math] time.

Applications

Colored Maximum Variance Unfolding (Colored MVU)

In the original version of the MVU, a kernel [math]\displaystyle{ \,K }[/math] is learned via maximizing [math]\displaystyle{ \,\text{Tr}[K] }[/math] subject to some constraints. Now assume we have some similarity/dissimilarity relationships between data points stored in matrix [math]\displaystyle{ B }[/math]. We may modify the original form of MVU by maximizing [math]\displaystyle{ \,\text{Tr}[KB] }[/math] instead of [math]\displaystyle{ \,\text{Tr}[K] }[/math] subject to the same constraints. In this way, while keeping the pairwise distances between data points (as in the original form of MVU) we simultaneously maximize the dependence between the learned kernel [math]\displaystyle{ \,\!K }[/math] and kernel [math]\displaystyle{ \,\!B }[/math] storing our side information. Note that, in this formulation, we centeralize matrix [math]\displaystyle{ \,\!K }[/math] via one of the consraines in the problem and thus can exclude matrix [math]\displaystyle{ \,\!H }[/math] in empirical estimation of HSIC.

Principle Components Analysis (PCA)

As known, in the PCA formulation [math]\displaystyle{ X\approx WH }[/math], matrix [math]\displaystyle{ \,\!H }[/math] may be thought of as a low dimentional representation of [math]\displaystyle{ \,\!X }[/math]. The objective function in PCA may be written as follows:
[math]\displaystyle{ \max_H \,\text{Tr}[HX^TXH^T]=\text{Tr}[(X^TX)(H^TH)]\,\, }[/math] subject to [math]\displaystyle{ \,HH^T=I }[/math].
As marked in the right side of the equation above, this expression may be interpreted as finding matrix [math]\displaystyle{ \,\!H }[/math] such that the linear kernel obtained from that has maximum dependence with the kernel obtained from the original data matrix [math]\displaystyle{ \,\!X }[/math]. Note that here we again assume that the data matrix [math]\displaystyle{ \,\!X }[/math] is already centralized and thus we do not need to include matrix [math]\displaystyle{ \,\!H }[/math] in our HSIC estimate.

Non-negative matrix factorization (NMF)

NMF with orthognal constraint on H may be expressed as [math]\displaystyle{ \,\! \max_{H\gt =0} Tr((X^{T}X)(H^{T}H)) }[/math]

Related work

One of the methods to find dependence between functions in RKHS, as proposed by Gretton et al ,the constarined covariance (COCO) was to use the largest singular value of the cross-covariance operator, which behaves identically to the correlation operator at independence and no regularization is required.
In this paper, the concept of COCO is extended by using the entire spectrum of the cross-covariance operator to determine when all its singular values are zero, rather than looking only at the largest singular value.

More detiled explanation

The above theorem shows that correlation can be used to study independence. In other words, for two random variables [math]\displaystyle{ X \, }[/math] and [math]\displaystyle{ Y \, }[/math], the functional correlation operator (the word functional here refers to the fact that the correlation is computed between a function of X and a function of Y, instead of between X and Y) sheds light on whether [math]\displaystyle{ X \, }[/math] and [math]\displaystyle{ Y \, }[/math] are independent (or how independent they are). Similarly, the functional cross-covariance operator also gives us information about the degree of dependence between [math]\displaystyle{ X \, }[/math] and [math]\displaystyle{ Y \, }[/math].

Below are some examples of how the functional correlation operator and the functional cross-covariance operator used as a statistics to test independence. For easier reading we will omit the word functional in the following examples; readers should keep in mind that the operators indeed act on functions of the random variables.

Kernel Canonical Correlation(KCC)

In this approach, a regularized correlation operator is derived from the covariance operator of [math]\displaystyle{ X \, }[/math], the covariance operator of [math]\displaystyle{ Y \, }[/math] and the cross-covariance operator between [math]\displaystyle{ X \, }[/math] and [math]\displaystyle{ Y \, }[/math].

Constrained Covariance (COCO)

In this approach, the largest singular value of the cross-covariance operator is used to test independence. Compared to KCC, this method has the advantage that no regularization is needed.

This paper (HSIC)

The paper is indeed an extension of COCO, in the sense that the paper looks at the entire spectrum of the cross-covariance operator to determine whether all its singular values are zero, instead of looking at only the largest singular value, which is what COCO does. It is thus intuitively obvious that a more robust indication of independence than COCO can be obtained. More specifically, the measure of independence is the sum of the squared singular values of the cross-covariance operator(its squared Hilbert-Schmidt norm), which the authors termed HSIC(Hilbert Schmidt Independence Criterion).

Kernel Canonical Correlation Analysis (CCA)

In Classical CCA linear mapping [math]\displaystyle{ a^{T}X }[/math] and [math]\displaystyle{ b^{T}Y }[/math] are obtained, respectively for random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math], such that the linear projections have maximum correlation.
Kernel CCA extends this method to estimate functions [math]\displaystyle{ f\in \mathcal{F} }[/math] and [math]\displaystyle{ g\in \mathcal{G} }[/math] such that the correlation of [math]\displaystyle{ f(X) }[/math] and [math]\displaystyle{ g(Y) }[/math] is maximized.
to find [math]\displaystyle{ f }[/math] and [math]\displaystyle{ g }[/math] the following optimization problem must be solved:

[math]\displaystyle{ max_{(f \in \mathcal{F} , g \in \mathcal{G})(f \neq 0 , g \neq 0)} \frac{Cov(f(x),g(Y))}{(Var(f(X))_{1/2} (Var(g(y))_{1/2}} }[/math]

For finite independent sample [math]\displaystyle{ {(X_{1},Y_{1}),(X_{2},Y_{2}),...,(X_{n},Y_{n})} }[/math] estimators for functions [math]\displaystyle{ f }[/math] and [math]\displaystyle{ g }[/math] can be obtained by optimizing the empirical version of the optimization problem:

[math]\displaystyle{ max_{(f \in \mathcal{F} , g \in \mathcal{G})(f \neq 0 , g \neq 0)} \frac{\widehat{Cov}(f(x),g(Y))}{(\widehat{Var}(f(X)+\epsilon_{n}||f||_{\mathcal{F}}^{2})_{1/2} (\widehat{Var}(g(y)+\epsilon_{n}||g||_{\mathcal{G}}^{2})_{1/2}} }[/math]

for which:

[math]\displaystyle{ \widehat{Cov}(f(x),g(Y))= \frac {1}{n} \sum_{i=1}^{n} \left( f(X_{i})- \sum_{j=1}^{n}f(X_{j}) \right) \left( g(Y_{i})- \sum_{j=1}^{n}g(Y_{j}) \right) }[/math]
[math]\displaystyle{ \widehat{Var}(f(X))= \frac {1}{n} \sum_{i=1}^{n} \left( f(X_{i})- \sum_{j=1}^{n}f(X_{j}) \right)^{2} }[/math] [math]\displaystyle{ \widehat{Var}(g(Y))= \frac {1}{n} \sum_{i=1}^{n} \left( g(Y_{i})- \sum_{j=1}^{n}g(Y_{j}) \right)^{2} }[/math]

Representation of Kernel CCA by Cross-Covariance Operator

Optimization problem for kernel CCA can be formulated using Cross-Covariance Operator for [math]\displaystyle{ (X,Y) }[/math]

[math]\displaystyle{ Sup_{(f \in \mathcal{F} , g \in \mathcal{G})} \langle g,\Sigma_{XY} f \rangle_{\mathcal{G}} }[/math] subject to [math]\displaystyle{ \langle f,\Sigma_{XX} f \rangle_{\mathcal{F}}=1 }[/math] and [math]\displaystyle{ \langle g,\Sigma_{YY} g \rangle_{\mathcal{G}}=1 }[/math]

Solution of this problem is the eigenfunction of the largest eigenvalue of the following eigenproblem

[math]\displaystyle{ \Sigma_{YX} f = \rho_{1} \Sigma_{YY} g }[/math]

[math]\displaystyle{ \Sigma_{YX} g = \rho_{1} \Sigma_{XX} f }[/math]

References

<references/>