neighbourhood Components Analysis: Difference between revisions
Mirzazadeh (talk | contribs) |
Mirzazadeh (talk | contribs) |
||
Line 37: | Line 37: | ||
<math>g(A)=\sum_i \log(\sum_{j \in C_i} p_{ij})=\sum_i \log(p_i)</math> | <math>g(A)=\sum_i \log(\sum_{j \in C_i} p_{ij})=\sum_i \log(p_i)</math> | ||
with gradient: | |||
<math> \frac{\partial g}{\partial A} = 2A \sum_i \left( \sum_k p_{ik} x_{ik} x_{ik}^T - \frac{\sum_{j \in C_i} p_{ij} x_{ij} x_{ij}^T}{\sum_{j \in C_i} p_{ij} } \right)</math> | |||
The scale of <math>A</math> found by NCA, and relative row directions can provide an estimate to the optimal number of neighbours to use for KNN. In Stochastic NN, a larger scale A tends to consult average over fewer neighbours, and vice versa. | The scale of <math>A</math> found by NCA, and relative row directions can provide an estimate to the optimal number of neighbours to use for KNN. In Stochastic NN, a larger scale A tends to consult average over fewer neighbours, and vice versa. |
Revision as of 23:14, 5 July 2009
Introduction
Neighbourhood Components Analysis (NCA) is a method for learning a Mahalnobis distance measure for k-nearest neighbours (KNN). In particular, it finds a distance metric that maximises the leave one out (LOO) error on the training set for a stochastic variant of KNN. NCA can also learn a low-dimensional linear embedding of labelled data for data visualisation and for improved KNN classification speed. Unlike other methods, this classification model is non-parametric without any assumption on the shape of the class distributions or the boundaries between them.
k-Nearest Neighbours
k-Nearest neighbours is a simple classification technique that determines a test point's label by looking at the labels of the [math]\displaystyle{ k }[/math] training points that are nearest the test point. This is a surprisingly effective method that has a non-linear decision surface that is non-parametric, except for the parameter [math]\displaystyle{ k }[/math].
However, KNN suffers from two problems. First, it can be computationally expensive to classify points, as they must be compared to the entire training set. There is also the problem of determining which distance metric is to be used to define "nearest" points.
Stochastic Nearest Neighbours
NCA attacks the above two problems of KNN. It finds a quadratic distance metric that defines which points are nearest. It can restrict this distance metric to be low rank, reducing the dimensionality of the data and thus reducing storage and search times.
Starting with a labeled dataset [math]\displaystyle{ x_1, x_2, ..., x_n \in R^D }[/math] and corresponding class labels [math]\displaystyle{ \,\!c_1, c_2, ..., c_n }[/math], we are looking for a distance metric that maximizes the performance of nearest neighbor classification. As we do not knw the actual future data, optimizing Leave-one-out performance on the training data can be substituted.
NCA finds the matrix [math]\displaystyle{ A }[/math] where [math]\displaystyle{ Q=A^TA }[/math] and distance between two points is defined as: [math]\displaystyle{ d(x,y) = (x - y)^TQ(x-y) = (Ax - Ay)^T(Ax - Ay) }[/math]
The basic idea of NCA is to find the distance metric [math]\displaystyle{ A }[/math] that maximises the KNN classification on test data. As test data is not available during training, the method uses Leave One Out (LOO) performance. However, KNN has a discontinuous error function as points can suddenly jump class as they cross class boundaries. Instead, NCA optimises using stochastic neighbour assignment. Here, a test point is assigned a neighbour (and the neighbour's class) according to probability [math]\displaystyle{ p_{ij} }[/math]. This probability decreases as the distance between points increases: [math]\displaystyle{ p_{ij} = \frac{exp(-\parallel Ax_i - Ax_j \parallel ^2)}{\sum_{k \neq i}exp(-\parallel Ax_i - Ax_k \parallel ^2)} }[/math] , [math]\displaystyle{ \textstyle p_{ii} = 0 }[/math].
Here, for a single point every point in the training set is considered to be it's neighbour but with different probabilities and the probability of that point being the neighbour of itself is considered to be zero. And, in this sense it becomes a leave one out classification.
This stochastic selection rule helps us find the probability [math]\displaystyle{ p_i }[/math] that point [math]\displaystyle{ i }[/math] will be correctly classified:
[math]\displaystyle{ p_i = \sum_{j\in c_i} p_{ij} }[/math] where [math]\displaystyle{ c_i=\{j|c_i=c_j\} }[/math]
NCA maximises the expected number of correctly classified points according to this objective function:
[math]\displaystyle{ f(A) = \sum_i \sum_{j \in C_i} p_{ij} = \sum_i p_i }[/math]
The gradient for which is easily found:
[math]\displaystyle{ \frac{\partial F}{\partial A} = 2A \sum_i \left( p_i \sum_k p_{ik} x_{ik} x_{ik}^T - \sum_{j \in C_i} p_{ij} x_{ij} x_{ij}^T \right) }[/math]
NCA then finds A using a gradient-based optimiser. Note, however, that the cost function is not convex, so the optimisation must handle local minima. The authors comment that they never observed any effects of overfitting in this method, and that performance never decreased as the training set size increased.
It is possible to use a cost function based on KL-divergence, and the authors note that performance is very similar to that achieved through the above cost function. The cost function based on KL divergence is:
[math]\displaystyle{ g(A)=\sum_i \log(\sum_{j \in C_i} p_{ij})=\sum_i \log(p_i) }[/math]
with gradient:
[math]\displaystyle{ \frac{\partial g}{\partial A} = 2A \sum_i \left( \sum_k p_{ik} x_{ik} x_{ik}^T - \frac{\sum_{j \in C_i} p_{ij} x_{ij} x_{ij}^T}{\sum_{j \in C_i} p_{ij} } \right) }[/math]
The scale of [math]\displaystyle{ A }[/math] found by NCA, and relative row directions can provide an estimate to the optimal number of neighbours to use for KNN. In Stochastic NN, a larger scale A tends to consult average over fewer neighbours, and vice versa.
Low Rank Distance Metrics and Nonsquare Projection
NCA allows for linear dimensionality reduction of the input data. This provides computational advantages and is relatively immune to overfitting.
During the optimisation procedure in NCA, the matrix [math]\displaystyle{ A }[/math] can be restricted to size [math]\displaystyle{ d \times D }[/math], where [math]\displaystyle{ d \lt D }[/math]. This forces the learned metric to be low rank, and maps all inputs to [math]\displaystyle{ \mathbb{R}^d }[/math].
If [math]\displaystyle{ d \ll D }[/math], the storage and search times for KNN will be much lower. Storage can be reduced from [math]\displaystyle{ O(DN) }[/math] to [math]\displaystyle{ O(dN) + dD }[/math] by storing the training set in the lower-dimensional projection. If [math]\displaystyle{ d }[/math] is 2 or 3, then projected points can be easily plotted for data visualisation.
The NCA method remains the same as before, optimise the above cost function with respect to [math]\displaystyle{ A }[/math], however now [math]\displaystyle{ A }[/math] is non-square. Note that NCA itself does not provide suggestions on the optimal size of [math]\displaystyle{ d }[/math].
Experimental Results
The NCA method is evaluated using data from the UCI repository, as well as generated data. It is compared to other Mahalanobis measures for KNN, where it consistently performed better on training data, and test data performance was always at or near the top.
The method was also compared to PCA and LDA for dimensionality reduction. These are rather basic methods, and NCA produced better 2D projections of the data. Using these 2D projections for classification, performed better consistently for both training and test data.
No timing data was provided by the authors.
Figures are available in the original paper.
Extensions to Continuous Labels and Semi-Supervised Learning
NCA finds a linear transformation to maximise the classification performance on discrete classes. The method can be extended to find a non-linear transformation function [math]\displaystyle{ A }[/math] using the same gradient-based optimisation technique. However, the transformation function must be trainable by gradient methods, such as multilayer perceptrons.
It is also possible to adapt the method to support real-valued, continuous class labelling/targets. Using the above objective function and gradients, define the set [math]\displaystyle{ C_i }[/math] to be a set of correct matches, that is points with similar class labels to [math]\displaystyle{ i }[/math]. The objective function is a weighted sum of points, based on similarity of classes. This allows semi-supervised learning, as target classes need to be explicitly specified, just similarity.
Relationship to Other Methods
Here are other Mahalanobis distance metric learning methods for labelled data. Relevant component analysis (RCA) finds a distance metric, but assumes the classes have Gaussian distributions whereas NCA makes no assumption about class distribution. Xing et al's method finds a transformation that minimises the pairwise distances between points of the same class. This assumes that all points of the one class form a single, compact and connected set. Lowe's method is very similar to NCA, but further constrains the distance metric to be diagonal.
For low rank, linear transformation of data, LDA will be optimal if the classes are Gaussian with common covariance. This is not going to be true in general. LDA also has difficulties with mathematical instability when the training set is small, but the dimensionality of the training data is high. There are newer variants of LDA that improve this instability, and also improve robustness to outliers.