learning a Nonlinear Embedding by Preserving Class Neighborhood Structure
Introduction
The paper <ref>Salakhutdinov, R., & Hinton, G. E. (2007). Learning a nonlinear embedding by preserving class neighbourhood structure. AI and Statistics.</ref> presented here describes a method to learn a nonlinear transformation from the input space to a low-dimensional feature space in which K-nearest neighbour classification performs well. As the performance of algorithms like K-nearest neighbours (KNN) that are based on computing distances, the main objective of the proposed algorithm is to learn a good similarity measure that can provide insight into how high-dimensional data is organized. The nonlinear transformation is learned by pre-training and fine-tuning a multilayer neural network. The authors also show how to enhance the performance of non-linear transformation further using unlabeled data. Experimental results on a widely used version of the MNIST handwritten digit recognition task show that proposed algorithm achieves a much lower error rate than SVM or standard backpropagation.
Background and Related Work
Learning a similarity measure (or distance metric) over the input space [math]\displaystyle{ {\mathbf X} }[/math] is an important task in machine learning, and is closely related to the feature extraction problem. A distance metric [math]\displaystyle{ \mathbf D }[/math] (e. g. Euclidean) measures the similarity between two input vectors [math]\displaystyle{ {\mathbf x}^a, {\mathbf x}^b \in {\mathbf X} }[/math] by computing [math]\displaystyle{ \mathbf D[{\mathbf f}(x^a|W),{\mathbf f}(x^b|W)] }[/math], where [math]\displaystyle{ {\mathbf f}(x|W) }[/math] represents the mapping function from input vector [math]\displaystyle{ {\mathbf X} }[/math] to feature space [math]\displaystyle{ {\mathbf Y} }[/math] parametrized by [math]\displaystyle{ {\mathbf W} }[/math]. Previous work studied this problem where [math]\displaystyle{ \mathbf D }[/math] is the Euclidean distance and [math]\displaystyle{ {\mathbf f} }[/math] is simple linear projection, i.e. [math]\displaystyle{ {\mathbf f}(x|W)=Wx }[/math]. For example Linear discriminant analysis (LDA) learns the matrix [math]\displaystyle{ W }[/math] that minimizes the within-class distances to between-class distances ratio.
Globerson and Roweis <ref> A. Globerson and S. T. Roweis. Metric learning by collapsing classes. In NIPS, 2005 </ref> proposed a method for learning the matrix [math]\displaystyle{ W }[/math] such that the input vectors from the same class are mapped to a tight cluster. Also, Weinberger et.al. <ref> K. Q.Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS, 2005. </ref> also learned [math]\displaystyle{ W }[/math] with the goals of both making the K-nearest neighbours belong to the same class and making examples from different classes be separated by a large margin. All these methods rely on linear transformation, which has a limited number of parameters and thus cannot model higher-order correlations between the original data dimensions.
Proposed Method
In this paper, authors show that a nonlinear transformation function, with many more parameters, enables us to discover low-dimensional representations of high-dimensional data that perform much better than existing linear methods provided the dataset is large enough to allow for the parameters to be estimated. Regarding the digit recognition application considered in the paper and adopting a probabilistic approach, authors learn the non-linear transformation by maximizing the log probability of the pairs that occur in the training set. The probability distribution over all possible pairs of images [math]\displaystyle{ \mathbf x^a, \mathbf x^b }[/math], is defined using the squared distances between their codes, [math]\displaystyle{ {\mathbf f}(x^a),{\mathbf f}(x^b) }[/math]:
This formulation is an attempt to model the structure in the pairings, not the structure in the individual images or the mutual information between the code vectors.
Neighborhood Component Analysis
Pre-training step
Fine-tuning step
Regularized Nonlinear NCA
Splitting codes into class-relevant and class-irrelevant parts
Experiments
References
<references/>