parametric Local Metric Learning for Nearest Neighbor Classification
Overview
This paper addresses the problem of supervised local metric learning for nearest neighbor (or generally K-NN) classifiers. Here, by local, we mean that separate metrics are found for each data point. These metrics are formed by a linear combination of some "basis metrics". In the algorithm, both the basis metrics and the local weights of each data point are learned. While basis metrics are learned in a supervised way, the local weights are learned in an unsupervised manner, which reduces the risk of overfitting. Furthermore, the authors use the idea of manifold regularization to further overcome overfitting. Also, large margin triplet constraints are used to find basis metrics, which further improves the results.
Prepared Knowledge
Definition A vector-valued function [math]\displaystyle{ f(x) }[/math] on [math]\displaystyle{ {\mathbb{R}}^d }[/math] is a [math]\displaystyle{ (\alpha , \beta , p) }[/math]-Lipschitz smooth function with respect to a vector norm [math]\displaystyle{ \| . \| }[/math] if [math]\displaystyle{ \| f(x) - f(x^') \| \leq \alpha \| x-x^' \| }[/math] and [math]\displaystyle{ \| f(x) - f(x^') - \nabla {f(x^')}^T (x-x^') \| \leq \beta \| x-x^' \|^{1+p} }[/math], where [math]\displaystyle{ \nabla {f(x^')}^T }[/math] is the derivative of the f function on [math]\displaystyle{ x^' }[/math]. We assume [math]\displaystyle{ \alpha , \beta \gt 0 }[/math] and [math]\displaystyle{ p \in (0,1) }[/math].
Introduction
The squared Mahalanobis distance between two instances is given by
[math]\displaystyle{ d^{2}_{M} = (x_{i} - x_{j})^{T}M(x_{i} - x_{j}) }[/math]
One way of having local metrics, is to define different distance metrics for different training samples.
[math]\displaystyle{ d^{2}_{M_i} = (x_{i} - x_{j})^{T}M_i(x_{i} - x_{j}) }[/math]
In fact, this way of defining local metrics does not satisfy the symmetric property of a metric. However, although not a metric, in the paper it is called metric learning. Nevertheless, this abuse of notation can give intuitions.
We assume that each local metric [math]\displaystyle{ M_i }[/math] is a linear combination of the basis metrics, [math]\displaystyle{ M_{b_k} }[/math]
[math]\displaystyle{ {M_i} = \sum_{b_k}W_{i,b_k}M_{b_k}, W_{i,b_k} \geq 0, \sum_{b_k}W_{i,b_k}=1 }[/math]
As can be seen, the basis metrics [math]\displaystyle{ M_{b_k} }[/math] and the weights [math]\displaystyle{ W_{i,b_k} }[/math] should be learned; once they are known, we will have the local metric (well, if we can call it a metric).
Furthermore, each basis metric corresponds to an anchor point in the input space. These anchor points are found using k-means clustering (other methods could also be used).
Learning the Weights
The weights are learned in an unsupervised way. In fact, the weights do not even depend on the basis metrics [math]\displaystyle{ M_{b_k} }[/math]. We compute the weights in a way that
- Each training point is approximated as a weighted combination of the anchor points.
- The weights should be local; if a training sample is far from an anchor point, the corresponding weight should be small.
- The weights of nearby training points should be similar (i.e., manifold regularization)
Therefore, the optimization problem is defined as follows
[math]\displaystyle{ min_{w}g(W) = \Vert X-WU \Vert^2_F + \lambda_{1}tr(WG) + \lambda_{2}tr(W^{T}LW) }[/math]
[math]\displaystyle{ s.t. W_{i,b_k} \geq 0, \sum_{b_k}W_{i,b_k}=1 }[/math]
where [math]\displaystyle{ G }[/math] is the distance matrix between the anchor points and the training points. [math]\displaystyle{ L }[/math] is the Laplacian matrix over the training points which can be computed used k-nearest-neighbor graph.
This optimization problem is quadratic (and also convex) and has a unique solution. It can be solved using gradient based methods. The authors use a projected gradient method and improve it by first order FISTA method.
Algorithm: Smooth local linear weights learning
Input: initial weight matrix [math]\displaystyle{ W^{0} }[/math], and [math]\displaystyle{ X,U,G,L,\lambda_{1},\lambda_{2} }[/math]
Output: the weights [math]\displaystyle{ W }[/math]
Initialize:
1). Define [math]\displaystyle{ \tilde{g}_{\beta,Y}(W)=g(Y)+tr\left\{ \nabla g(Y)^{T}(W-Y) \right\}+\frac{\beta}{2}\Vert W-Y \Vert^2_F }[/math]
2). initialize [math]\displaystyle{ t_{1}=1,\beta=1,Y^{1}=W }[/math] and [math]\displaystyle{ i=0 }[/math]
3). set [math]\displaystyle{ i=i+1 }[/math], [math]\displaystyle{ W^{i}=Proj((Y^{i}-\frac{1}{\beta} \nabla g(Y^{i}))) }[/math]
4). IF [math]\displaystyle{ g(W^{i})\gt \tilde{g}_{\beta,Y^{i}}(W^{i}) }[/math], set [math]\displaystyle{ \beta=2\beta }[/math], [math]\displaystyle{ W^{i}=Proj((Y^{i}-\frac{1}{\beta} \nabla g(Y^{i}))) }[/math]
5). [math]\displaystyle{ t_{i+1}=\frac{1+\sqrt{1+4t_{i}^{2}}}{2} , Y^{i+1}=W^{i}+\frac{t_{i}-1}{t_{i+1}}(W^{i}-W^{i-1}) }[/math]
6). Repeat 3 -- 5 until converges
Learning the Basis Metrics
The basis metrics, which correspond to the anchor points, are learned in a supervised way. In fact, they are learned in a way that the triplet constraints are satisfied. However, this might not be possible all the time. Therefore, they use a soft margin constraint.
[math]\displaystyle{ min_{M_{b_k}, \epsilon} \alpha_{1}\sum_{b_l}\Vert M_{b_l} \Vert^2_F + \sum_{i,j,k}\epsilon_{i,j,k} +\alpha_{2}\sum_{i,j}\sum_{b_l}W_{i,b_l}d_{M_{b_l}}^{2}(x_i,x_j) }[/math]
[math]\displaystyle{ s.t. \sum_{b_l}(W_{i,b_l}d_{M_{b_l}}^{2}(x_i,x_j)-W_{i,b_l}d_{M_{b_l}}^{2}(x_i,x_j)) \geq 1-\epsilon_{i,j,k} , \epsilon_{i,j,k} \geq 0, M_{b_l}\succeq 0 }[/math]
The soft margin triplet constraints for each sample are generated in a way that k1 nearest neighbors of the same class be closer than the k2 nearest neighbor of the other classes.
The above optimization problem can be solved efficiently by taking into account the following Lagrangian dual problem. The benefit of the dual problem is that with a fixed Lagrangian multiplier, the problem has a closed form solution. Therefore, we just need to find the Lagrange multipliers. The authors use a Box projection operator to find the solution.
[math]\displaystyle{ min_{\gamma} g(\gamma)=-\sum_{ijk}\gamma_{ijk}+\sum_{b_l}\frac{1}{4\alpha_1}\Vert(K_{b_l})_+-K_{b_l}\Vert_F^2 }[/math]
[math]\displaystyle{ s.t. 1\geq \gamma_{ijk}\geq 0; \forall i,j,k }[/math]
The optimal condition for [math]\displaystyle{ M_{b_l} }[/math] is [math]\displaystyle{ M_{b_l}^*=\frac{1}{2\alpha_1}((K_{b_l}^*)_+-K_{b_l}^*) }[/math]. At each iteration, [math]\displaystyle{ \gamma^{i+1}=BoxProj(\gamma^i-\eta\nabla g(\gamma^i)) }[/math].
Experiments
The authors compare their method with a number of other methods, including LMNN and LMNN-MM. They argue that they have acquired superior results on a number of different datasets. However, their method has a lot of parameters and it is not clear how much parameter tuning is necessary. Nevertheless, they report that their method is better than the state of the art, according to McNemar's test.
Furthermore, they experimentally verify that the method does not overfit, even if the number of basis metrics is large. They argue that this is because of manifold regularization. The performance of CBLML gets worse when the number of basis metrics is large which provides further evidence that CBLML does indeed overfit the learning problems, where as PLML doesn't, demonstrating the utility of the manifold regularization.
The experimental results show that PLML outperforms the state of the art metric learning methods and it has a performance which is significantly better or equivalent to that of SVM with automatic kernel selection.
Conclusions
The authors proposed a method for local metric learning for KNN classifiers. In the method, for each training sample a different metric is learned. However, the metrics are a linear combination of basis metrics and the weights are learned unsupervised. Therefore, the risk of overfitting is low. Furthermore, the authors propose to use a regularization method to make the metric more smooth to avoid overfitting. Afterwards, the basis metrics are learned supervisedly, with triplet constraints. Both of the optimization problems are convex and are solved using some kinds of gradient projection methods. There are a lot of parameters for the algorithm, but the authors discuss the usefulness of the method in practice.
Discussion
The paper proposes local distance metric, which is a way to perform non-linear transformation based on Mahalanobis distance metric other than kernelization. It does not actually calculate the local Mahalanobis matrix for each data point, but represent the data points in the new space by a weighted combination of the anchor points. Even though this can avoid overfitting, the method still seems more easy to overfit than using a single metric. However, using the a weighted linear combination of anchor points of local metrics should provide more robust result without overlooking too much information,given that the anchor points are proply selected. But the number of anchor points is difficult to be determined, however this is a general problem whenever anchor points or landmark points method are used.