kernelized Sorting: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
Line 108: Line 108:
entropy <math>h(X,Y)</math>.
entropy <math>h(X,Y)</math>.


optimization
Optimization
 
<math>
<math>
\pi^{\ast}=argmin_{\pi \in \Pi_{m}}|\log HJ(\pi)H|,
\pi^{\ast}=argmin_{\pi \in \Pi_{m}}|\log HJ(\pi)H|,
</math>
</math>
where <math>J_{ij}=K_{ij}L_{\pi(i),\pi(j)}</math>. This is related to the
where <math>J_{ij}=K_{ij}L_{\pi(i),\pi(j)}</math>. This is related to the
optimization criterion proposed by Jebara(2004) in the context of
optimization criterion proposed by Jebara(2004) in the context of
aligning bags of observations by sorting via minimum volume PCA.
aligning bags of observations by sorting via minimum volume PCA.


==Optimization==
==Optimization==

Revision as of 13:10, 12 July 2009

Object matching is a fundamental operation in data analysis. It typically requires the definition of a similarity measure between the classes of objects to be matched. Instead, we develop an approach which is able to perform matching by requiring a similarity measure only within each of the classes. This is achieved by maximizing the dependency between matched pairs of observations by means of the Hilbert Schmidt Independence Criterion. This problem can be cast as one of maximizing a quadratic assignment problem with special structure and we present a simple algorithm for finding a locally optimal solution.

Introduction

Problem Statement

Assume we are given two collections of documents purportedly covering the same content, written in two different languages. Can we determine the correspondence between these two sets of documents without using a dictionary?

Sorting and Matching

(Formal) problem formulation:

Given two sets of observations [math]\displaystyle{ X= \{ x_{1},..., x_{m} \} }[/math] and [math]\displaystyle{ Y=\{ y_{1},..., y_{m}\} }[/math]

Find a permutation matrix [math]\displaystyle{ \pi \in \Pi_{m} }[/math],

[math]\displaystyle{ \Pi_{m}:= \{ \pi | \pi \in \{0,1\}^{m \times m} where \pi 1_{m}=1_{m}, \pi^{T}1_{m}=1_{m}\} }[/math]

such that [math]\displaystyle{ \{ (x_{i},y_{\pi (i)}) for 1 \leqslant i \leqslant m \} }[/math] is maximally dependent. Here [math]\displaystyle{ 1_{m} \in \mathbb{R}^{m} }[/math] is the vector of all ones.

Denote by [math]\displaystyle{ D(Z(\pi)) }[/math] a measure of the dependence between x and y. Then we define nonparametric sorting of X and Y as follows

[math]\displaystyle{ \pi^{\ast}:=argmax_{\pi \in \prod_{m}}D(Z(\pi)). }[/math]


Hilbert Schmidt Independence Criterion

Let sets of observations X and Y be drawn jointly from some probability distribution [math]\displaystyle{ Pr_{xy} }[/math]. The Hilbert Schmidt Independence Criterion (HSIC) measures the dependence between x and y by computing the norm of the cross-covariance operator over the domain [math]\displaystyle{ \mathcal X \times \mathcal Y }[/math] in Hilbert Space.

let [math]\displaystyle{ \mathcal {F} }[/math] be the Reproducing Kernel Hilbert Space (RKHS) on [math]\displaystyle{ \mathcal {X} }[/math] with associated kernel [math]\displaystyle{ k: \mathcal X \times \mathcal X \rightarrow \mathbb{R} }[/math] and feature map [math]\displaystyle{ \phi: \mathcal X \rightarrow \mathcal {F} }[/math]. Let [math]\displaystyle{ \mathcal {G} }[/math] be the RKHS on [math]\displaystyle{ \mathcal Y }[/math] with kernel [math]\displaystyle{ l }[/math] and feature map [math]\displaystyle{ \psi }[/math]. The cross-covariance operator [math]\displaystyle{ C_{xy}:\mathcal {G}\rightarrow \mathcal {F} }[/math] is defined by

[math]\displaystyle{ C_{xy}=\mathbb{E}_{xy}[(\phi(x)-\mu_{x})\otimes (\psi(y)-\mu_{y})], }[/math]

where [math]\displaystyle{ \mu_{x}=\mathbb{E}[\phi(x)] }[/math], [math]\displaystyle{ \mu_{y}=\mathbb{E}[\psi(y)] }[/math].

HSIC is the square of the Hilbert-Schmidt norm of the cross covariance operator [math]\displaystyle{ C_{xy} }[/math] [math]\displaystyle{ D(\mathcal {F},\mathcal {G},Pr_{xy}):=\parallel C_{xy} \parallel_{HS}^{2}. }[/math]

In term of kernels, HSIC can be expressed as [math]\displaystyle{ \mathbb{E}_{xx'yy'}[k(x,x')l(y,y')]+\mathbb{E}_{xx'}[k(x,x')]\mathbb{E}_{yy'}[l(y,y')]-2\mathbb{E}_{xy}[\mathbb{E}_{x'}[k(x,x')]\mathbb{E}_{y}[l(y,y')]]. }[/math] where [math]\displaystyle{ \mathbb{E}_{xx'yy'} }[/math] is the expectation over both [math]\displaystyle{ (x, y) }[/math] ~ [math]\displaystyle{ Pr_{xy} }[/math] and an additional pair of variables [math]\displaystyle{ (x', y') }[/math] ~ [math]\displaystyle{ Pr_{xy} }[/math] drawn independently according to the same law.

A biased estimator of HSIC given finite sample [math]\displaystyle{ Z = \{(x_{i}, y_{i})\}_{i=1}^{m} }[/math] drawn from [math]\displaystyle{ Pr_{xy} }[/math] is [math]\displaystyle{ D(\mathcal {F},\mathcal {G},Z)=(m-1)^{-2}tr HKHL = (m-1)^{-2} tr KL }[/math] where [math]\displaystyle{ K,L\in \mathbb{R}^{m\times m} }[/math] are the kernel matrices for the data and the labels respectively, [math]\displaystyle{ H_{ij}=\delta_{ij}-m^{-1} }[/math] centers the data and the labels in feature space, [math]\displaystyle{ K:=HKH }[/math] and [math]\displaystyle{ L:=HLH }[/math] denote the centered versions [math]\displaystyle{ K }[/math] and [math]\displaystyle{ L }[/math] respectively.

Advantages of HSIC are:

Computing HSIC is simple: only the kernel matrices K and L are needed;

HSIC satisfies concentration of measure conditions, i.e. for random draws of observation from [math]\displaystyle{ Pr_{xy} }[/math], HSIC provides values which are very similar;

Incorporating prior knowledge into the dependence estimation can be done via kernels.

Kernelized Sorting

Optimization problem

[math]\displaystyle{ \pi^{\ast}=argmax_{\pi \in \Pi_{m}}[tr K \pi^{T}L\pi] }[/math]



Sorting as a special case

For scalar [math]\displaystyle{ x_{i} }[/math] and [math]\displaystyle{ y_{i} }[/math] and a linear kernel on both sets, we can rewrite the optimization problem [math]\displaystyle{ \pi^{\ast}=argmax_{\pi \in \Pi_{m}}[X^{T} \pi Y]^{2} }[/math] This is maximized by sorting [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math].


Related Work

Mutual Information is defined as, [math]\displaystyle{ I(X,Y)=h(X)+h(Y)-h(X,Y) }[/math]. We can approximate MI maximization by maximizing its lower bound. This then corresponds to minimizing an upper bound on the joint entropy [math]\displaystyle{ h(X,Y) }[/math].

Optimization

[math]\displaystyle{ \pi^{\ast}=argmin_{\pi \in \Pi_{m}}|\log HJ(\pi)H|, }[/math]

where [math]\displaystyle{ J_{ij}=K_{ij}L_{\pi(i),\pi(j)} }[/math]. This is related to the optimization criterion proposed by Jebara(2004) in the context of aligning bags of observations by sorting via minimum volume PCA.

Optimization

Convex Objective and Convex Domain

Define [math]\displaystyle{ \pi }[/math] as a doubly stochastic matrix,

[math]\displaystyle{ P_{m}:=\{\pi \in \mathbb{R}^{m \times m} where \pi_{ij}\geqslant 0 and \sum_{i}\pi_{ij}=1 and \sum_{j}\pi_{ij}=1\} }[/math]

The objective function [math]\displaystyle{ tr K \pi^{T}L\pi }[/math] is convex in [math]\displaystyle{ \pi }[/math] .