measuring and testing dependence by correlation of distances

From statwiki
Revision as of 11:58, 15 August 2013 by L274wang (talk | contribs)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Background

dCov is another dependence measurement between two random vectors. For all random variables with finite first moments, the dCov coefficient generalizes the idea of correlation in two ways. First, dCov can be applied to any two random variables with any dimensions, that is the two random variables could be in different dimensions. Second, dCov is equal to zero if and only is the two variables are independent. These features are very similar to HSIC(Hilbert Schmidt Independence Criteria). The relationship between the two coefficient has been a open question. A recent research [2] shows that HSIC is a extension of dCov.

Definition

The dCov coefficient is de�ned as a weighted L2 distance between the joint and the product of the marginal characteristic functions of the random vectors. The choice of the weights is crucial and ensures the independence property. The dCov coe�cient can also be written in terms of the expectations of Euclidean distances which is easier to interpret:

[math]\displaystyle{ \mathcal{V}^2=E[|X_1-X_2||Y_1-Y_2|]+E[|X_1-X_2|)E(|Y_1-Y_2|]-E[|X_1-X_2||Y_1-Y_3|] }[/math] in which [math]\displaystyle{ X_1,X_2,Y_1,Y_2 }[/math] are independent copies of the randoms variables X and Y. |X_1,X_2| is the Euclidean distance. A straightforward empirical estimate [math]\displaystyle{ \mathcal{V}_n^2 }[/math] is known as [math]\displaystyle{ dCov_n^2(X,Y) }[/math]: [math]\displaystyle{ dCov_n^2(X,Y)=\frac{1}{n^2}\sum_{i,j=1}^n d_{ij}^Xd_{ij}^Y+d_{..}^Xd_{..}^Y-2\frac{1}{n}\sum_{i=1}^n d_{i.}^Xd_{i.}^Y\\ =\frac{1}{n^2}\sum_{i,j=1}^n (d_{ij}^X-d_{i.}^X-d_{.j}^X+d_{..}^X)(d_{ij}^Y-d_{i.}^Y-d_{.j}^Y+d_{..}^Y) }[/math]

once [math]\displaystyle{ dCov_n^2(X,Y) }[/math] is defined, the correlation coefficient can be defined: [math]\displaystyle{ dCor_n^2(X,Y)=\frac{\lt C\Delta_XC,C\Delta_YC\gt }{||C\Delta_XC|| ||C\Delta_YC||} }[/math] Because the Euclidean distance used in dCor is not squared Euclidean, dCor is able to detect non-linear correlations.


Experiment

References

[1] Székely, Gábor J., Maria L. Rizzo, and Nail K. Bakirov. "Measuring and testing dependence by correlation of distances." The Annals of Statistics 35.6 (2007): 2769-2794. [2] Sejdinovic, Dino, et al. "Equivalence of distance-based and RKHS-based statistics in hypothesis testing." arXiv preprint arXiv:1207.6076 (2012). [3] Fukumizu, Kenji, Francis R. Bach, and Michael I. Jordan. "Kernel dimension reduction in regression." The Annals of Statistics 37.4 (2009): 1871-1905.