neighbourhood Components Analysis

From statwiki
Revision as of 22:31, 28 June 2009 by Ltdufton (talk | contribs)
Jump to navigation Jump to search

Introduction

Neighbourhood Components Analysis (NCA) is a method for learning a Mahalnobis distance measure for k-nearest neighbours (KNN). In particular, it finds a distance metric that maximises the leave one out (LOO) error on the training set for a stochastic variant of KNN. NCA can also learn a low-dimensional linear embedding of labelled data for data visualisation and for improved KNN classification speed.

k-Nearest Neighbours

k-Nearest neighbours is a simple classification technique that determines a test point's label by looking at the labels of the [math]\displaystyle{ k }[/math] training points that are nearest the test point. This is a surprisingly effective method that has a non-linear decision surface that is non-parametric, except for the parameter [math]\displaystyle{ k }[/math].

However, KNN suffers from two problems. First, it can be computationally expensive to classify points, as they must be compared to the entire training set. There is also the problem of determining which distance metric to define ``nearest points.

NCA and Stochastic Nearest Neighbours

NCA attacks the above two problems of KNN. It finds a distance metric that defines which points are nearest. It can restrict this distance metric to be low rank, reducing the dimensionality of the data and thus reducing storage and search times.

NCA finds the matrix [math]\displaystyle{ A }[/math] where [math]\displaystyle{ Q=A^TA }[/math] and distance between two points is defined as: [math]\displaystyle{ d(x,y) = (x - y)^TQ(x-y) = (Ax - Ay)^T(Ax - Ay) }[/math]

Low Rank Distance Metrics and Nonsquare Projection

Experimental Results

Extensions to Continuous Labels and Semi-Supervised Learning

Relationship to Other Methods