positive Semidefinite Metric Learning Using Boosting-like Algorithms: Difference between revisions

From statwiki
Jump to navigation Jump to search
(Created page with "A Mahlanobis distance metric learning method is proposed in the paper. The training samples are made into “triplets” (a<sub>i</sub>, a<sub>j</sub>, a<sub>k</sub>), where a<su...")
 
No edit summary
Line 1: Line 1:
A Mahlanobis distance metric learning method is proposed in the paper. The training samples are made into “triplets” (a<sub>i</sub>, a<sub>j</sub>, a<sub>k</sub>), where a<sub>i</sub> and a<sub>j</sub> are in the same class, but ai and ak are in different classes. The goal is to find a positive semi-definite matrix M so that the difference of dist(i,j) and dist(i,k) for all the triplets can be as large as possible in the new transformation space. This can be formulated into a semi-definite programming (SDP) problem. The authors propose to replace M by a convex combination of one-rank p.s.d. matrices. Instead of solving these matrices, they assume that the matrices are known. However, the possibility of such matrices is infinite, and thus the number of constraints in the dual problem would be too many to solve. So they use column generation to find the most violated constraint in the dual problem iteratively and add the constraint into the optimization framework. The weight for the corresponding one-rank matrix is calculated. This is similar to the classification method “boosting”, which has the power of turning many weak classifiers to a strong classifier. Here the one-rank matrices can be considered as the weak classifiers. When no more constraint can be added, the final matrix to be learned is the weighted combination of the one-rank matrices added. Three types of objective functions are proposed in the paper: hinge loss, exponential loss and logistic loss. In each iteration, a new base (one-rank matrix) is computed using eigenvalue-decomposition. There are two strategies to update weights. One is called “stage-wise boosting”, in which the weight for the current base is computed based on the previous weights which are fixedThe bisection method is used to search the optimal weight. The other strategy is called “totally corrective boosting”, in which the weights are optimized in each iteration. The test results show that this method is faster than the LMNN algorithm as the input dimensions increase, while the test error is even slight lower than LMNN.
The BoostMetric algorithm proposed in the paper aims to learn a Mahlanobis distance. It is claimed to be faster than the previous metric learning algorithms because it uses a special optimization technique to solve the semi-definite programming (SDP) problem.
 
 
== 1. Mathematical model for the problem ==
 
 
The optimization problem of BoostMetric is almost the same as the large margin nearest neighbor algorithm (LMNN [1]). In the preprocessing step, the labeled training samples are required to be transformed into "triplets" (a <sub>i</sub>, a <sub>j</sub>, a <sub>k</sub>), where a <sub>i</sub> and a <sub>j</sub> are in the same class, but a <sub>i</sub> and a <sub>k</sub> are in different classes. Let us denote dist <sub>i,j</sub>  and dist<sub>i,k</sub>  as the distance between a<sub>i</sub>  and a<sub>j</sub>  and the distance between a<sub>i</sub>  and a<sub>k</sub>  separately. The goal is to maximize the difference between these two distances.
 
Here the distance is Mahlanobis matrix represented as follows:
 
<math>dist_{ij}^{2}=\left \| L^Ta_i-L^Ta_j \right \|_2^2=(a_i-a_j)^TLL^T(a_i-a_j)=(a_i-a_j)^TX(a_i-a_j).</math>
 
where we can see that the matrix X is a positive semi definite (p.s.d.) matrix.
 
To make calculation easier, we maximize </math>dist_{ij}^2-dist_{ik}^2</math> instead. Then we denote
 
<math>dist_{ij}^2-dist_{ik}^2=<A_r,X > </math>
 
where
 
<math>A_r=(a_i-a_k)(a_i-a_k)^T-(a_i-a_j)(a_i-a_j)^T</math>
 
for all triplets (i,j,k).
 
Here X is unknown. To find the optimal  X, we need to solve a semi-definite programming problem: to minimize <math>\sum_{r}F(<A_r,X>)+vTr(X)</math> where <math>F(\cdot )</math> is a convex loss function. The paper provides three loss functions for modeling: hinge loss, exponential loss and logistic loss. Also the trace norm regularization can be replaced by the Frobenius norm.
 
 
== 2. Optimization problems for three loss functions ==
 
 
For the hinge loss function, the optimization problem is formulated as:
 
<math>max_{\rho ,\omega, \xi }\rho-v\sum_{r=1}^{\left | I \right |}\xi_r, s.t.: \left \langle A_r, X \right \rangle \geq \rho-\xi_r, \forall r; X \succeq 0, Tr(X)=1; \xi \geq 0.</math>
 
where the constraint <math>Tr(X)=1</math> is to remove the "scale ambiguity".
   
Based on some theories, X can be decomposed into a convex combination of rank-one matrices. Also their traces are constrained to one:
 
<math>X=\sum_{j=1}^{J}w_jZ_j</math>, with <math>w_j>0</math>, <math>Rank(Z_j)=1</math> and <math>Tr(Z_j)=1, \forall j.</math>
 
Substituting X, we can get
 
<math>\left \langle A_r,X\right \rangle=\left \langle A_r,\sum_{j=1}^{J}w_jZ_j\right \rangle=\sum_{j=1}^{J}w_j\left \langle A_r,Z_j\right \rangle=\sum_{j=1}^Jw_jH_{rj}=H_{r:}w, \forall r. </math>
 
Then the optimization problem can be reformulated as:
 
<math>max_{\rho ,\omega, \xi }\rho-v\sum_{r=1}^{\left | I \right |}\xi_r, s.t.: H_{r:}\omega\geq \rho-\xi_r, (r=1,...,\left | I \right |); \omega \geq 0, 1^T\omega=1;\xi \geq 0.</math>
 
Its Lagrange dual problem is
 
<math>min_{\pi, u}\pi s.t.: \sum_{r=1}^{\left | I \right |}u_rH_{r:} \leq \pi\mathbf{1}^T; \mathbf{1}^Tu=1, 0 \leq u \leq v\mathbf{1}.</math>

Revision as of 18:55, 4 July 2013

The BoostMetric algorithm proposed in the paper aims to learn a Mahlanobis distance. It is claimed to be faster than the previous metric learning algorithms because it uses a special optimization technique to solve the semi-definite programming (SDP) problem.


1. Mathematical model for the problem

The optimization problem of BoostMetric is almost the same as the large margin nearest neighbor algorithm (LMNN [1]). In the preprocessing step, the labeled training samples are required to be transformed into "triplets" (a i, a j, a k), where a i and a j are in the same class, but a i and a k are in different classes. Let us denote dist i,j and disti,k as the distance between ai and aj and the distance between ai and ak separately. The goal is to maximize the difference between these two distances.

Here the distance is Mahlanobis matrix represented as follows:

[math]\displaystyle{ dist_{ij}^{2}=\left \| L^Ta_i-L^Ta_j \right \|_2^2=(a_i-a_j)^TLL^T(a_i-a_j)=(a_i-a_j)^TX(a_i-a_j). }[/math]

where we can see that the matrix X is a positive semi definite (p.s.d.) matrix.

To make calculation easier, we maximize </math>dist_{ij}^2-dist_{ik}^2</math> instead. Then we denote

[math]\displaystyle{ dist_{ij}^2-dist_{ik}^2=\lt A_r,X \gt }[/math]

where

[math]\displaystyle{ A_r=(a_i-a_k)(a_i-a_k)^T-(a_i-a_j)(a_i-a_j)^T }[/math]

for all triplets (i,j,k).

Here X is unknown. To find the optimal X, we need to solve a semi-definite programming problem: to minimize [math]\displaystyle{ \sum_{r}F(\lt A_r,X\gt )+vTr(X) }[/math] where [math]\displaystyle{ F(\cdot ) }[/math] is a convex loss function. The paper provides three loss functions for modeling: hinge loss, exponential loss and logistic loss. Also the trace norm regularization can be replaced by the Frobenius norm.


2. Optimization problems for three loss functions

For the hinge loss function, the optimization problem is formulated as:

[math]\displaystyle{ max_{\rho ,\omega, \xi }\rho-v\sum_{r=1}^{\left | I \right |}\xi_r, s.t.: \left \langle A_r, X \right \rangle \geq \rho-\xi_r, \forall r; X \succeq 0, Tr(X)=1; \xi \geq 0. }[/math]

where the constraint [math]\displaystyle{ Tr(X)=1 }[/math] is to remove the "scale ambiguity".

Based on some theories, X can be decomposed into a convex combination of rank-one matrices. Also their traces are constrained to one:

[math]\displaystyle{ X=\sum_{j=1}^{J}w_jZ_j }[/math], with [math]\displaystyle{ w_j\gt 0 }[/math], [math]\displaystyle{ Rank(Z_j)=1 }[/math] and [math]\displaystyle{ Tr(Z_j)=1, \forall j. }[/math]

Substituting X, we can get

[math]\displaystyle{ \left \langle A_r,X\right \rangle=\left \langle A_r,\sum_{j=1}^{J}w_jZ_j\right \rangle=\sum_{j=1}^{J}w_j\left \langle A_r,Z_j\right \rangle=\sum_{j=1}^Jw_jH_{rj}=H_{r:}w, \forall r. }[/math]

Then the optimization problem can be reformulated as:

[math]\displaystyle{ max_{\rho ,\omega, \xi }\rho-v\sum_{r=1}^{\left | I \right |}\xi_r, s.t.: H_{r:}\omega\geq \rho-\xi_r, (r=1,...,\left | I \right |); \omega \geq 0, 1^T\omega=1;\xi \geq 0. }[/math]

Its Lagrange dual problem is

[math]\displaystyle{ min_{\pi, u}\pi s.t.: \sum_{r=1}^{\left | I \right |}u_rH_{r:} \leq \pi\mathbf{1}^T; \mathbf{1}^Tu=1, 0 \leq u \leq v\mathbf{1}. }[/math]