positive Semidefinite Metric Learning Using Boosting-like Algorithms
The BoostMetric algorithm proposed by Shen et al. [1] aims to learn a Mahlanobis distance. It is claimed to be faster than the previous metric learning algorithms because it uses a special optimization technique to solve the semi-definite programming (SDP) problem.
1. Intention
Learning a valid Mahalanobis distance metric requires the matrix parameter of the metric to be positive semi-definite. Semi-definite programming is difficult to implement and does not scale well. Based on observation that any positive semi-definite matrix can be decomposed into a linear combination of trace-one rank-one matrices, this paper introduces the BoostMetric method, which uses rank-one positive semi-definite matrices as weak learner within an efficient and scalable boosting-based learning process.
2. Mathematical model of the problem
The optimization problem of BoostMetric is similar to the large margin nearest neighbor algorithm (LMNN [4]). In the preprocessing step, the labeled training samples are required to be transformed into "triplets" (a i, a j, a k), where a i and a j are in the same class, but a i and a k are in different classes. Let us denote dist i,j and disti,k as the distance between ai and aj and the distance between ai and ak separately. The goal is to maximize the difference between these two distances.
Here the distance is Mahalanobis matrix represented as follows:
[math]\displaystyle{ dist_{ij}^{2}=\left \| L^Ta_i-L^Ta_j \right \|_2^2=(a_i-a_j)^TLL^T(a_i-a_j)=(a_i-a_j)^TX(a_i-a_j). }[/math]
where [math]\displaystyle{ L \in \mathbb{R}^{D \times d} }[/math] is a projection matrix. We can see that the matrix X is a positive semi definite (p.s.d.) matrix.
We try to maximize the difference of two distances:
[math]\displaystyle{ dist_{ij}^2-dist_{ik}^2=\left \langle A_r,X \right \rangle }[/math]
where
[math]\displaystyle{ A_r=(a_i-a_k)(a_i-a_k)^T-(a_i-a_j)(a_i-a_j)^T }[/math]
for all triplets (i,j,k).
Here X is unknown. To find the optimal X, we need to solve a semi-definite programming problem: to minimize [math]\displaystyle{ \sum_{r}F(\left \langle A_r,X \right \rangle)+vTr(X) }[/math] where [math]\displaystyle{ F(\cdot ) }[/math] is a convex loss function. The paper provides three loss functions for modeling: hinge loss, exponential loss and logistic loss. Also the trace norm regularization can be replaced by the Frobenius norm.
3. Optimization problems for three loss functions
For the hinge loss function, the optimization problem is formulated as:
[math]\displaystyle{ max_{\rho ,\omega, \xi }\rho-v\sum_{r=1}^{\left | I \right |}\xi_r, s.t.: \left \langle A_r, X \right \rangle \geq \rho-\xi_r, \forall r; X \succeq 0, Tr(X)=1; \xi \geq 0. }[/math]
where the constraint [math]\displaystyle{ Tr(X)=1 }[/math] is to remove the "scale ambiguity".
Based on some theories, X can be decomposed into a convex combination of rank-one matrices. Also their traces are constrained to one:
[math]\displaystyle{ X=\sum_{j=1}^{J}w_jZ_j }[/math], with [math]\displaystyle{ w_j\gt 0 }[/math], [math]\displaystyle{ Rank(Z_j)=1 }[/math] and [math]\displaystyle{ Tr(Z_j)=1, \forall j. }[/math]
Substituting X, we can get
[math]\displaystyle{ \left \langle A_r,X\right \rangle=\left \langle A_r,\sum_{j=1}^{J}w_jZ_j\right \rangle=\sum_{j=1}^{J}w_j\left \langle A_r,Z_j\right \rangle=\sum_{j=1}^Jw_jH_{rj}=H_{r:}w, \forall r. }[/math]
Then the optimization problem can be reformulated as:
[math]\displaystyle{ max_{\rho ,\omega, \xi }\rho-v\sum_{r=1}^{\left | I \right |}\xi_r, s.t.: H_{r:}\omega\geq \rho-\xi_r, (r=1,...,\left | I \right |); \omega \geq 0, 1^T\omega=1;\xi \geq 0. }[/math]
It is very difficult to solve this problem directly. We could consider about its dual problem. The Lagrange dual problem of the above linear programming can be derived as:
[math]\displaystyle{ \underset{\pi , u}{\text{min}} \ \pi \ \text{s.t.:} {\sum}_{r=1}^{|I|} u_r H_{r:} \leq \pi 1^T ; 1^T u=1, 0 \leq u \leq v1 . }[/math]
For exponential loss function, the optimization problem after substituting X is:
[math]\displaystyle{ min_{\rho, \omega}log (\sum_{r=1}^{\left | I \right |}exp(-\rho_r))+v\mathbf{1}^T\mathbf{\omega} }[/math]
[math]\displaystyle{ s.t.: \rho_r=H_{r:}\omega, r=1,...,|I|; \omega \geq 0. }[/math]
For logistic loss function, the optimization problem is:
[math]\displaystyle{ min_{\rho,\omega}\sum_{r=1}^{|I|}logit(\rho_r)+v\mathbf{1}^T\mathbf{\omega} }[/math]
[math]\displaystyle{ s.t.: \rho_r=H_{r:}\omega, r=1,...,|I|; \omega \geq 0. }[/math]
4. Optimization strategies
Solving the indefinite number of rank one matrices is difficult. But if we assume that those matrices are known as a priori, the problem will turn out to be an LP problem. However, the choices of the one rank matrices are infinite. So the column generation (CG) method which is often used to solve large-scale optimization problems is used here. The basic idea of CG is to find the variables with negative reduced costs without explicitly searching all variables. At the beginning, we can only consider a small subset of Z, which is called "restricted master problem (RMP)". Then we turn to the dual problem and see if any constraint that has not been added is violated. If there is such a constraint, we need to add it into RMP and re-optimize it. If no constraint is violated, the solution of the restricted problem is equivalent to the solution of the original problem.
In our case, we can set the primal coefficients to be zero at the beginning, and find the optimal Z in each iteration until there is no violation. This is the so-called base learning algorithm. Take the hinge loss optimization problem as an example. According to the dual problem, we want to find a new Z' such that
[math]\displaystyle{ \sum_{r=1}^{|S|}\left \langle A^r,Z' \right \rangle \omega_r \gt \tilde{\pi} }[/math]
To make convergence fast, the optimal base is the one that maximizes the LHS. It can be solved by eigen-decomposition. In each iteration, such rank-one matrix is found. If it is smaller than [math]\displaystyle{ \pi }[/math], the solution is already optimal. If not, the new base is added into the RMP.
The proposed method is similar to the boosting methods because it tries to approximate the target matrix (strong classifier) by a weighted combination of the rank-one matrices (weak classifiers). The key of boosting is to calculate the weights. The paper provides two strategies. One is called "stage-wise boosting", which is more similar to the "AdaBoost"[2] algorithm for classification. The weights are fixed once they are calculated, and the new weights are calculated using the previous weights. The bisection method is used to search the optimal weight. The other strategy is called "totally corrective boosting", in which the weights are optimized in each iteration. It is similar to the "TotalBoost"[3] classification algorithm.
The whole algorithm is as follows:
Input: Training set triplets [math]\displaystyle{ (a_i , a_j , a_k) \in I }[/math];
[math]\displaystyle{ M }[/math]: maximum number of iterations;
threshold [math]\displaystyle{ \pi }[/math];
1. Initialize by assigning uniform weights: [math]\displaystyle{ w_r=\frac{1}{|S|}, r= 1,...,|S| }[/math]
2. For j=1 to M
(a) Find an optimal new base Z' using eigen-decomposition;
(b) if the largest eigenvalue found in step (a) [math]\displaystyle{ \lt \pi }[/math] then break;
(c) Add Z' to the restricted master problem;
(d) Compute [math]\displaystyle{ {\omega}_i }[/math] using either stage-wise boosting or totally corrective boosting and update other parameters;
end
Output: the leaned p.s.d. matrix [math]\displaystyle{ X\in \mathbb{R}^{D\times D}, X=\sum_{i=1}^M{\omega}_i Z^i. }[/math]
5. Test results
The test results show that this method is faster than the LMNN algorithm as the input dimensions increase (Figure 1), while the test error is even slight lower than LMNN (Figure 2). The overall testing results are in Table 1.
Figure 1. Comparison on running time
Figure 2. Comparison on accuracy
Table 1. Overall test results
5. Discussion
The BoostMetric algorithm has following properties:
(1). BoostMetric is efficient and scalable. In constructing the Mahalanobis metric (parameterized by matrix [math]\displaystyle{ X }[/math]), the matrix [math]\displaystyle{ X }[/math] is enforced to be p.s.d. and thus we can get rid of p.s.d. constraint in the optimization. This fact leads to a efficient computation, but the price is that the solution is only locally optimal (not necessarily globally optimal).
(2). BoostMetric does not involve extra tuning parameters. User only need to specify the stopping criterion for the iterative computation.
(3). BoostMetric, as shown above, is compatible with different types of constraints, which allows an easy extension in other learning problems.
References
[1] Shen, Chunhua, et al. "Positive semidefinite metric learning using boosting-like algorithms." The Journal of Machine Learning Research 98888 (2012): 1007-1036.
[2] Freund, Yoav, and Robert E. Schapire. "A desicion-theoretic generalization of on-line learning and an application to boosting." Computational learning theory. Springer Berlin Heidelberg, 1995.
[3] Warmuth, Manfred K., Jun Liao, and Gunnar Rätsch. "Totally corrective boosting algorithms that maximize the margin." Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
[4] Weinberger, Kilian Q., and Lawrence K. Saul. "Distance metric learning for large margin nearest neighbor classification." The Journal of Machine Learning Research 10 (2009): 207-244.