metric and Kernel Learning Using a Linear Transformation

From statwiki
Revision as of 15:50, 5 August 2013 by F33li (talk | contribs)
Jump to navigation Jump to search

Introduction

The motivation of the papers is to combine the advantages of Mahalanobis distance learning and kernel learning. Mahalanobis distance learning can deal with out-of-sample points and avoid overfitting, while it is merely linear transformation and the number of parameters grows quadratically with the dimensionality. Kernel learning using non-linear kernels enable implicit non-linear transformation, while it is not easy to be applied to out-of-sample points. The paper shows metric learning with linear transformations is equivalent to learning a linear transformation kernel. They use LogDet divergence to formulate a metric learning problem. The LogDet divergence can make the optimization simpler and have a precedence in optimization and statistics. They claim that their algorithm is scalable to large data sets and allow learning of a linear transformation in the kernel space so that it can handle out-of-sample extensions. Also, even though the matrix to be learned may be infinite-dimensional, it can be fully represtend in terms of the constrained data points.


LogDet Divergence Based Metric Learning

The original Mahalanobis distance metric is as follows:

[math]\displaystyle{ d_W(x_i,x_j)=(x_i-x_j)^TW(x_i-x_j) }[/math]

In order to deal with the nonlinear case, we use a nonlinear mapping of the data points:

[math]\displaystyle{ d_W(\phi(x_i),\phi(x_j))=(\phi(x_i)-\phi(x_j))^TW(\phi(x_i)-\phi(x_j)) }[/math]

The main challenge is the computation of W. The paper uses the LogDet divergence. The LogDet divergence between two PSD matrices W and $W_0\in\Re^{d \times d}$ is defined as:

[math]\displaystyle{ D_ld(W,W_0)=tr(WW_0^{-1})-logdet(WW_0^{-1})-d }[/math]

Given the similarity constrains S and dissimilarity constrains D, the following problem is proposed in [1]:

File:itml formulation.png

We can also get the kernelized form of the above formulation [2]:

File:itml kernelized.png

The paper proves that there are connections between the optimal solutions of the above two problems. Suppose the optimal learned kernel is K* and the optimal learned Mahalanobis matrix is W*, they are related as follows:

File:W K relation.png

The identical matrix I can be generalized to any arbitrary matrix W0 and the theory holds. To compute the out-of-sample extension, we can represent the inner product of the points after linear transformation by the inner product of the points before transformation: