neighbourhood Components Analysis
Introduction
Neighbourhood Components Analysis (NCA) is a method for learning a Mahalnobis distance measure for k-nearest neighbours (KNN). In particular, it finds a distance metric that maximises the leave one out (LOO) error on the training set for a stochastic variant of KNN. NCA can also learn a low-dimensional linear embedding of labelled data for data visualisation and for improved KNN classification speed.
k-Nearest Neighbours
k-Nearest neighbours is a simple classification technique that determines a test point's label by looking at the labels of the [math]\displaystyle{ k }[/math] training points that are nearest the test point. This is a surprisingly effective method that has a non-linear decision surface that is non-parametric, except for the parameter [math]\displaystyle{ k }[/math].
However, KNN suffers from two problems. First, it can be computationally expensive to classify points, as they must be compared to the entire training set. There is also the problem of determining which distance metric to define "nearest" points.
NCA attacks the above two problems of KNN. It finds a distance metric that defines which points are nearest. It can restrict this distance metric to be low rank, reducing the dimensionality of the data and thus reducing storage and search times.
NCA finds the matrix [math]\displaystyle{ A }[/math] where [math]\displaystyle{ Q=A^TA }[/math] and distance between two points is defined as: [math]\displaystyle{ d(x,y) = (x - y)^TQ(x-y) = (Ax - Ay)^T(Ax - Ay) }[/math]
The basic idea of NCA is to find the distance metric [math]\displaystyle{ A }[/math] that maximises the KNN classification on test data. As test data is not available during training, the method uses Leave One Out (LOO) performance. However, KNN has a discontinuous error function as points can suddenly jump class as they cross class boundaries. Instead, NCA optimises using stochastic neighbour assignment. Here, a test point is assigned a neighbour (and the neighbour's class) according to probability [math]\displaystyle{ p_{ij} }[/math]. This probability decreases as the distance between points increases.
NCA maximises the expected number of correctly classified points according to this objective function:
[math]\displaystyle{ f(A) = \sum_i \sum_{j \in C_i} p_{ij} = \sum_i p_i }[/math]
The gradient for which is easily found:
[math]\displaystyle{ \frac{\partial F}{\partial A} = 2A \sum_i \left( p_i \sum_k p_{ik} x_{ik} x_{ik}^T - \sum_{j \in C_i} p_{ij} x_{ij} x_{ij}^T \right) }[/math]
NCA then finds A using a gradient-based optimiser. Note, however, that the cost function is not convex, so the optimisation must handle local minima. The authors comment that they never observed any effects of overfitting in this method, and that performance never decreased as the training set size increased.
It is possible to use a cost function based on KL-divergence, and the authors note that performance is very similar to that achieved through the above cost function.
The scale of [math]\displaystyle{ A }[/math] found by NCA, and relative row directions can provide an estimate to the optimal number of neighbours to use for KNN. In Stochastic NN, a larger scale A tends to consult average over fewer neighbours, and vice versa.
Low Rank Distance Metrics and Nonsquare Projection
NCA allows for linear dimensionality reduction of the input data. This provides computational advantages and is relatively immune to overfitting.
During the optimisation procedure in NCA, the matrix [math]\displaystyle{ A }[/math] can be restricted to size [math]\displaystyle{ d \times D }[/math], where [math]\displaystyle{ d \lt D }[/math]. This forces the learned metric to be low rank, and maps all inputs to [math]\displaystyle{ \mathbb{R}^d }[/math].
If [math]\displaystyle{ d \ll D }[/math], the storage and search times for KNN will be much lower. Storage can be reduced from [math]\displaystyle{ O(DN) }[/math] to [math]\displaystyle{ O(dN) + dD }[/math] by storing the training set in the lower-dimensional projection. If [math]\displaystyle{ d }[/math] is 2 or 3, then projected points can be easily plotted for data visualisation.
The NCA method remains the same as before, optimise the above cost function with respect to [math]\displaystyle{ A }[/math], however now [math]\displaystyle{ A }[/math] is non-square. Note that NCA itself does not provide suggestions on the optimal size of [math]\displaystyle{ d }[/math].