positive Semidefinite Metric Learning Using Boosting-like Algorithms

From statwiki
Jump to navigation Jump to search

The BoostMetric algorithm proposed in the paper aims to learn a Mahlanobis distance. It is claimed to be faster than the previous metric learning algorithms because it uses a special optimization technique to solve the semi-definite programming (SDP) problem.


1. Mathematical model for the problem

The optimization problem of BoostMetric is almost the same as the large margin nearest neighbor algorithm (LMNN [1]). In the preprocessing step, the labeled training samples are required to be transformed into "triplets" (a i, a j, a k), where a i and a j are in the same class, but a i and a k are in different classes. Let us denote dist i,j and disti,k as the distance between ai and aj and the distance between ai and ak separately. The goal is to maximize the difference between these two distances.

Here the distance is Mahlanobis matrix represented as follows:

[math]\displaystyle{ dist_{ij}^{2}=\left \| L^Ta_i-L^Ta_j \right \|_2^2=(a_i-a_j)^TLL^T(a_i-a_j)=(a_i-a_j)^TX(a_i-a_j). }[/math]

where we can see that the matrix X is a positive semi definite (p.s.d.) matrix.

We try to maximize the difference of two distances:

[math]\displaystyle{ dist_{ij}^2-dist_{ik}^2=\left \langle A_r,X \right \rangle }[/math]

where

[math]\displaystyle{ A_r=(a_i-a_k)(a_i-a_k)^T-(a_i-a_j)(a_i-a_j)^T }[/math]

for all triplets (i,j,k).

Here X is unknown. To find the optimal X, we need to solve a semi-definite programming problem: to minimize [math]\displaystyle{ \sum_{r}F(\left \langle A_r,X \right \rangle)+vTr(X) }[/math] where [math]\displaystyle{ F(\cdot ) }[/math] is a convex loss function. The paper provides three loss functions for modeling: hinge loss, exponential loss and logistic loss. Also the trace norm regularization can be replaced by the Frobenius norm.

2. Optimization problems for three loss functions

For the hinge loss function, the optimization problem is formulated as:

[math]\displaystyle{ max_{\rho ,\omega, \xi }\rho-v\sum_{r=1}^{\left | I \right |}\xi_r, s.t.: \left \langle A_r, X \right \rangle \geq \rho-\xi_r, \forall r; X \succeq 0, Tr(X)=1; \xi \geq 0. }[/math]

where the constraint [math]\displaystyle{ Tr(X)=1 }[/math] is to remove the "scale ambiguity".

Based on some theories, X can be decomposed into a convex combination of rank-one matrices. Also their traces are constrained to one:

[math]\displaystyle{ X=\sum_{j=1}^{J}w_jZ_j }[/math], with [math]\displaystyle{ w_j\gt 0 }[/math], [math]\displaystyle{ Rank(Z_j)=1 }[/math] and [math]\displaystyle{ Tr(Z_j)=1, \forall j. }[/math]

Substituting X, we can get

[math]\displaystyle{ \left \langle A_r,X\right \rangle=\left \langle A_r,\sum_{j=1}^{J}w_jZ_j\right \rangle=\sum_{j=1}^{J}w_j\left \langle A_r,Z_j\right \rangle=\sum_{j=1}^Jw_jH_{rj}=H_{r:}w, \forall r. }[/math]

Then the optimization problem can be reformulated as:

[math]\displaystyle{ max_{\rho ,\omega, \xi }\rho-v\sum_{r=1}^{\left | I \right |}\xi_r, s.t.: H_{r:}\omega\geq \rho-\xi_r, (r=1,...,\left | I \right |); \omega \geq 0, 1^T\omega=1;\xi \geq 0. }[/math]

Its Lagrange dual problem is

[math]\displaystyle{ min_{\pi, u}\pi s.t.: \sum_{r=1}^{\left | I \right |}u_rH_{r:} \leq \pi\mathbf{1}^T; \mathbf{1}^Tu=1, 0 \leq u \leq v\mathbf{1}. }[/math]