learning a Nonlinear Embedding by Preserving Class Neighborhood Structure
Introduction
The paper <ref>Salakhutdinov, R., & Hinton, G. E. (2007). Learning a nonlinear embedding by preserving class neighbourhood structure. AI and Statistics.</ref> presented here describes a method to learn a nonlinear transformation from the input space to a low-dimensional feature space in which K-nearest neighbour classification performs well. As the performance of algorithms like K-nearest neighbours (KNN) that are based on computing distances, the main objective of the proposed algorithm is to learn a good similarity measure that can provide insight into how high-dimensional data is organized. The nonlinear transformation is learned by pre-training and fine-tuning a multilayer neural network. The authors also show how to enhance the performance of non-linear transformation further using unlabeled data. Experimental results on a widely used version of the MNIST handwritten digit recognition task show that proposed algorithm achieves a much lower error rate than SVM or standard backpropagation.
Background and Related Work
Learning a similarity measure (or distance metric) over the input space [math]\displaystyle{ {\mathbf X} }[/math] is an important task in machine learning, and is closely related to the feature extraction problem. A distance metric [math]\displaystyle{ \mathbf D }[/math] (e. g. Euclidean) measures the similarity between two input vectors [math]\displaystyle{ {\mathbf x}^a, {\mathbf x}^b \in {\mathbf X} }[/math] by computing [math]\displaystyle{ \mathbf D[{\mathbf f}(x^a|W),{\mathbf f}(x^b|W)] }[/math], where [math]\displaystyle{ {\mathbf f}(x|W) }[/math] represents the mapping function from input vector [math]\displaystyle{ {\mathbf X} }[/math] to feature space [math]\displaystyle{ {\mathbf Y} }[/math] parametrized by [math]\displaystyle{ {\mathbf W} }[/math]. Previous work studied this problem where [math]\displaystyle{ \mathbf D }[/math] is the Euclidean distance and [math]\displaystyle{ {\mathbf f} }[/math] is simple linear projection, i.e. [math]\displaystyle{ {\mathbf f}(x|W)=Wx }[/math]. For example Linear discriminant analysis (LDA) learns the matrix [math]\displaystyle{ W }[/math] that minimizes the within-class distances to between-class distances ratio.
Globerson and Roweis <ref> A. Globerson and S. T. Roweis. Metric learning by collapsing classes. In NIPS, 2005 </ref> proposed a method for learning the matrix [math]\displaystyle{ W }[/math] such that the input vectors from the same class are mapped to a tight cluster. Also, Weinberger et.al. <ref> K. Q.Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS, 2005. </ref> also learned [math]\displaystyle{ W }[/math] with the goals of both making the K-nearest neighbours belong to the same class and making examples from different classes be separated by a large margin. All these methods rely on linear transformation, which has a limited number of parameters and thus cannot model higher-order correlations between the original data dimensions.
Neighborhood Component Analysis
Nonlinear NCA
Pre-training step
Fine-tuning step
Regularized Nonlinear NCA
Splitting codes into class-relevant and class-irrelevant parts
Experiments
References
<references/>