positive Semidefinite Metric Learning Using Boosting-like Algorithms

From statwiki
Revision as of 20:14, 4 July 2013 by F33li (talk | contribs)
Jump to navigation Jump to search

The BoostMetric algorithm proposed in the paper aims to learn a Mahlanobis distance. It is claimed to be faster than the previous metric learning algorithms because it uses a special optimization technique to solve the semi-definite programming (SDP) problem.


1. Mathematical model for the problem

The optimization problem of BoostMetric is almost the same as the large margin nearest neighbor algorithm (LMNN [1]). In the preprocessing step, the labeled training samples are required to be transformed into "triplets" (a i, a j, a k), where a i and a j are in the same class, but a i and a k are in different classes. Let us denote dist i,j and disti,k as the distance between ai and aj and the distance between ai and ak separately. The goal is to maximize the difference between these two distances.

Here the distance is Mahlanobis matrix represented as follows:

[math]\displaystyle{ dist_{ij}^{2}=\left \| L^Ta_i-L^Ta_j \right \|_2^2=(a_i-a_j)^TLL^T(a_i-a_j)=(a_i-a_j)^TX(a_i-a_j). }[/math]

where we can see that the matrix X is a positive semi definite (p.s.d.) matrix.

We try to maximize the difference of two distances:

[math]\displaystyle{ dist_{ij}^2-dist_{ik}^2=\left \langle A_r,X \right \rangle }[/math]

where

[math]\displaystyle{ A_r=(a_i-a_k)(a_i-a_k)^T-(a_i-a_j)(a_i-a_j)^T }[/math]

for all triplets (i,j,k).

Here X is unknown. To find the optimal X, we need to solve a semi-definite programming problem: to minimize [math]\displaystyle{ \sum_{r}F(\left \langle A_r,X \right \rangle)+vTr(X) }[/math] where [math]\displaystyle{ F(\cdot ) }[/math] is a convex loss function. The paper provides three loss functions for modeling: hinge loss, exponential loss and logistic loss. Also the trace norm regularization can be replaced by the Frobenius norm.

2. Optimization problems for three loss functions

For the hinge loss function, the optimization problem is formulated as:

[math]\displaystyle{ max_{\rho ,\omega, \xi }\rho-v\sum_{r=1}^{\left | I \right |}\xi_r, s.t.: \left \langle A_r, X \right \rangle \geq \rho-\xi_r, \forall r; X \succeq 0, Tr(X)=1; \xi \geq 0. }[/math]

where the constraint [math]\displaystyle{ Tr(X)=1 }[/math] is to remove the "scale ambiguity".

Based on some theories, X can be decomposed into a convex combination of rank-one matrices. Also their traces are constrained to one:

[math]\displaystyle{ X=\sum_{j=1}^{J}w_jZ_j }[/math], with [math]\displaystyle{ w_j\gt 0 }[/math], [math]\displaystyle{ Rank(Z_j)=1 }[/math] and [math]\displaystyle{ Tr(Z_j)=1, \forall j. }[/math]

Substituting X, we can get

[math]\displaystyle{ \left \langle A_r,X\right \rangle=\left \langle A_r,\sum_{j=1}^{J}w_jZ_j\right \rangle=\sum_{j=1}^{J}w_j\left \langle A_r,Z_j\right \rangle=\sum_{j=1}^Jw_jH_{rj}=H_{r:}w, \forall r. }[/math]

Then the optimization problem can be reformulated as:

[math]\displaystyle{ max_{\rho ,\omega, \xi }\rho-v\sum_{r=1}^{\left | I \right |}\xi_r, s.t.: H_{r:}\omega\geq \rho-\xi_r, (r=1,...,\left | I \right |); \omega \geq 0, 1^T\omega=1;\xi \geq 0. }[/math]

For exponential loss function, the optimization problem after substituting X is:

[math]\displaystyle{ min_{\rho, \omega}log (\sum_{r=1}^{\left | I \right |}exp(-\rho_r))+v\mathbf{1}^T\mathbf{\omega} }[/math]

[math]\displaystyle{ s.t.: \rho_r=H_{r:}\omega, r=1,...,|I|; \omega \geq 0. }[/math]

For logistic loss function, the optimization problem is:

[math]\displaystyle{ min_{\rho,\omega}\sum_{r=1}^{|I|}logit(\rho_r)+v\mathbf{1}^T\mathbf{\omega} }[/math]

[math]\displaystyle{ s.t.: \rho_r=H_{r:}\omega, r=1,...,|I|; \omega \geq 0. }[/math]


3. Optimization strategies

Solving the indefinite number of rank one matrices is difficult. But if we assume that those matrices are known as a priori, the problem will turn out to be an LP problem. However, the choices of the one rank matrices are infinite. So the column generation (CG) method which is often used to solve large-scale optimization problems is used here. The basic idea of CG is to find the variables with negative reduced costs without explicitly searching all variables. At the beginning, we can only consider a small subset of Z, which is called "restricted master problem (RMP)". Then we turn to the dual problem and see if any constraint that has not been added is violated. If there is such a constraint, we need to add it into RMP and re-optimize it. If no constraint is violated, the solution of the restricted problem is equivalent to the solution of the original problem.

In our case, we can set the primal coefficients to be zero at the beginning, and find the optimal Z in each iteration until there is no violation. This is the so-called base learning algorithm. Take the hinge loss optimization problem as an example. According to the dual problem, we want to find a new Z' such that

[math]\displaystyle{ \sum_{r=1}^{|S|}\left \langle A^r,Z' \right \rangle \omega_r \gt \tilde{\pi} }[/math]

To make convergence fast, the optimal base is the one that maximizes the LHS. It can be solved by eigen-decomposition. In each iteration, such rank-one matrix is found. If it is smaller than [math]\displaystyle{ \pi }[/math], the solution is already optimal. If not, the new base is added into the RMP.

The paper provides two boosting strategy. One is called "stage-wise boosting", which is more similar to the "AdaBoost"[2] algorithm for classification. The weights are fixed once they are calculated, and the new weights are calculated using the previous weights. The bisection method is used to search the optimal weight. The other strategy is called "totally corrective boosting", in which the weights are optimized in each iteration. It is similar to the "TotalBoost"[3] classification algorithm.