metric and Kernel Learning Using a Linear Transformation
Introduction
The motivation of the papers is to combine the advantages of Mahalanobis distance learning and kernel learning. Mahalanobis distance learning can deal with out-of-sample points and avoid overfitting, while it is merely linear transformation and the number of parameters grows quadratically with the dimensionality. Kernel learning using non-linear kernels enable implicit non-linear transformation, while it is not easy to be applied to out-of-sample points. The paper shows metric learning with linear transformations is equivalent to learning a linear transformation kernel. They use LogDet divergence to formulate a metric learning problem. The LogDet divergence can make the optimization simpler and have a precedence in optimization and statistics. They claim that their algorithm is scalable to large data sets and allow learning of a linear transformation in the kernel space so that it can handle out-of-sample extensions. Also, even though the matrix to be learned may be infinite-dimensional, it can be fully represtend in terms of the constrained data points.
LogDet Divergence Based Metric Learning
The original Mahalanobis distance metric is as follows:
[math]\displaystyle{ d_W(x_i,x_j)=(x_i-x_j)^TW(x_i-x_j) }[/math]
In order to deal with the nonlinear case, we use a nonlinear mapping of the data points:
[math]\displaystyle{ d_W(\phi(x_i),\phi(x_j))=(\phi(x_i)-\phi(x_j))^TW(\phi(x_i)-\phi(x_j)) }[/math]
The main challenge is the computation of W. The paper uses the LogDet divergence. The LogDet divergence between two PSD matrices W and $W_0\in\Re^{d \times d}$ is defined as:
[math]\displaystyle{ D_ld(W,W_0)=tr(WW_0^{-1})-logdet(WW_0^{-1})-d }[/math]
Given the similarity constrains S and dissimilarity constrains D, the following problem is proposed in [1]:
We can also get the kernelized form of the above formulation [2]:
The paper proves that there are connections between the optimal solutions of the above two problems. Suppose the optimal learned kernel is K* and the optimal learned Mahalanobis matrix is W*, they are related as follows:
The identical matrix I can be generalized to any arbitrary matrix W0 and the theory holds. To compute the out-of-sample extension, we can represent the inner product of the points after linear transformation by the inner product of the points before transformation: