metric and Kernel Learning Using a Linear Transformation: Difference between revisions
(Created page with " == Introduction == The motivation of the papers is to combine the advantages of Mahalanobis distance learning and kernel learning. Mahalanobis distance learning can deal with o...") |
No edit summary |
||
Line 1: | Line 1: | ||
== Introduction == | == Introduction == | ||
Line 21: | Line 20: | ||
Given the similarity constrains S and dissimilarity constrains D, the following problem is proposed in [1]: | Given the similarity constrains S and dissimilarity constrains D, the following problem is proposed in [1]: | ||
[[File:itml_formulation]] | [[File:itml_formulation.png]] | ||
We can also get the kernelized form of the above formulation [2]: | We can also get the kernelized form of the above formulation [2]: | ||
[[File:itml_kernelized]] | [[File:itml_kernelized.png]] | ||
The paper proves that there are connections between the optimal solutions of the above two problems. Suppose the optimal learned kernel is K<sup>*</sup> and the optimal learned Mahalanobis matrix is W<sup>*</sup>, they are related as follows: | The paper proves that there are connections between the optimal solutions of the above two problems. Suppose the optimal learned kernel is K<sup>*</sup> and the optimal learned Mahalanobis matrix is W<sup>*</sup>, they are related as follows: | ||
[[File:W_K_relation]] | [[File:W_K_relation.png]] | ||
The identical matrix I can be generalized to any arbitrary matrix W<sub>0</sub> and the theory holds. To compute the out-of-sample extension, we can represent the inner product of the points after linear transformation by the inner product of the points before transformation: | The identical matrix I can be generalized to any arbitrary matrix W<sub>0</sub> and the theory holds. To compute the out-of-sample extension, we can represent the inner product of the points after linear transformation by the inner product of the points before transformation: | ||
Revision as of 16:50, 5 August 2013
Introduction
The motivation of the papers is to combine the advantages of Mahalanobis distance learning and kernel learning. Mahalanobis distance learning can deal with out-of-sample points and avoid overfitting, while it is merely linear transformation and the number of parameters grows quadratically with the dimensionality. Kernel learning using non-linear kernels enable implicit non-linear transformation, while it is not easy to be applied to out-of-sample points. The paper shows metric learning with linear transformations is equivalent to learning a linear transformation kernel. They use LogDet divergence to formulate a metric learning problem. The LogDet divergence can make the optimization simpler and have a precedence in optimization and statistics. They claim that their algorithm is scalable to large data sets and allow learning of a linear transformation in the kernel space so that it can handle out-of-sample extensions. Also, even though the matrix to be learned may be infinite-dimensional, it can be fully represtend in terms of the constrained data points.
LogDet Divergence Based Metric Learning
The original Mahalanobis distance metric is as follows:
[math]\displaystyle{ d_W(x_i,x_j)=(x_i-x_j)^TW(x_i-x_j) }[/math]
In order to deal with the nonlinear case, we use a nonlinear mapping of the data points:
[math]\displaystyle{ d_W(\phi(x_i),\phi(x_j))=(\phi(x_i)-\phi(x_j))^TW(\phi(x_i)-\phi(x_j)) }[/math]
The main challenge is the computation of W. The paper uses the LogDet divergence. The LogDet divergence between two PSD matrices W and $W_0\in\Re^{d \times d}$ is defined as:
[math]\displaystyle{ D_ld(W,W_0)=tr(WW_0^{-1})-logdet(WW_0^{-1})-d }[/math]
Given the similarity constrains S and dissimilarity constrains D, the following problem is proposed in [1]:
We can also get the kernelized form of the above formulation [2]:
The paper proves that there are connections between the optimal solutions of the above two problems. Suppose the optimal learned kernel is K* and the optimal learned Mahalanobis matrix is W*, they are related as follows:
The identical matrix I can be generalized to any arbitrary matrix W0 and the theory holds. To compute the out-of-sample extension, we can represent the inner product of the points after linear transformation by the inner product of the points before transformation: