learning a Nonlinear Embedding by Preserving Class Neighborhood Structure

From statwiki
Jump to navigation Jump to search

Introduction

The paper <ref>Salakhutdinov, R., & Hinton, G. E. (2007). Learning a nonlinear embedding by preserving class neighbourhood structure. AI and Statistics.</ref> presented here describes a method to learn a nonlinear transformation from the input space to a low-dimensional feature space in which K-nearest neighbour classification performs well. As the performance of algorithms like K-nearest neighbours (KNN) that are based on computing distances, the main objective of the proposed algorithm is to learn a good similarity measure that can provide insight into how high-dimensional data is organized. The nonlinear transformation is learned by pre-training and fine-tuning a multilayer neural network. The authors also show how to enhance the performance of non-linear transformation further using unlabeled data. Experimental results on a widely used version of the MNIST handwritten digit recognition task show that proposed algorithm achieves a much lower error rate than SVM or standard backpropagation.

Background and Related Work

Learning a similarity measure (or distance metric) over the input space [math]\displaystyle{ {\mathbf X} }[/math] is an important task in machine learning, and is closely related to the feature extraction problem. A distance metric [math]\displaystyle{ \mathbf D }[/math] (e. g. Euclidean) measures the similarity between two input vectors [math]\displaystyle{ {\mathbf x}^a, {\mathbf x}^b \in {\mathbf X} }[/math] by computing [math]\displaystyle{ \mathbf D[{\mathbf f}(x^a|W),{\mathbf f}(x^b|W)] }[/math], where [math]\displaystyle{ {\mathbf f}(x|W) }[/math] represents the mapping function from input vector [math]\displaystyle{ {\mathbf X} }[/math] to feature space [math]\displaystyle{ {\mathbf Y} }[/math] parametrized by [math]\displaystyle{ {\mathbf W} }[/math]. Previous work studied this problem where [math]\displaystyle{ \mathbf D }[/math] is the Euclidean distance and [math]\displaystyle{ {\mathbf f} }[/math] is simple linear projection, i.e. [math]\displaystyle{ {\mathbf f}(x|W)=Wx }[/math]. For example Linear discriminant analysis (LDA) learns the matrix [math]\displaystyle{ W }[/math] that minimizes the within-class distances to between-class distances ratio.

Globerson and Roweis <ref> A. Globerson and S. T. Roweis. Metric learning by collapsing classes. In NIPS, 2005 </ref> proposed a method for learning the matrix [math]\displaystyle{ W }[/math] such that the input vectors from the same class are mapped to a tight cluster. Also, Weinberger et.al. <ref> K. Q.Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS, 2005. </ref> also learned [math]\displaystyle{ W }[/math] with the goals of both making the K-nearest neighbours belong to the same class and making examples from different classes be separated by a large margin. All these methods rely on linear transformation, which has a limited number of parameters and thus cannot model higher-order correlations between the original data dimensions.

Proposed Method

In this paper, authors show that a nonlinear transformation function, with many more parameters, enables us to discover low-dimensional representations of high-dimensional data that perform much better than existing linear methods provided the dataset is large enough to allow for the parameters to be estimated. Regarding the digit recognition application considered in the paper and adopting a probabilistic approach, one can learn the non-linear transformation by maximizing the log probability of the pairs that occur in the training set. The probability distribution over all possible pairs of images [math]\displaystyle{ \mathbf x^a, \mathbf x^b }[/math], is defined using the squared distances between their codes, [math]\displaystyle{ {\mathbf f}(x^a),{\mathbf f}(x^b) }[/math]:


[math]\displaystyle{ \mathbf p(x^a,x^b)= \frac{||f(x^a)-f(x^b)||^2}{\sum_{k\lt l} ||f(x^k)-f(x^l)||^2} }[/math]

This formulation is an attempt to model the structure in the pairings, not the structure in the individual images or the mutual information between the code vectors. An alternative approach used here is based on a recent discovery of an effective and unsupervised algorithm for training a multi-layer, non-linear ”encoder” network that transforms the input data vector [math]\displaystyle{ \mathbf X }[/math] into a low-dimensional feature representation [math]\displaystyle{ \mathbf f(x|W) }[/math] capturing a lot of the structure in the input data <ref> G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006. </ref>.

They proposed algorithm performs two steps: First, the recently discovered unsupervised algorithm is used in a pre-training stage, i.e.to initialize the parameter vector [math]\displaystyle{ W }[/math] that defines the mapping from input vectors to their low-dimensional representation. Next, the initial parameters are finetuned by performing gradient descent in the objective function defined by Neighbourhood Component Analysis (NCA) method <ref> J. Goldberger, S. T. Roweis, G. E. Hinton, and Ruslan Salakhutdinov. Neighbourhood components analysis. In NIPS, 2004 </ref>. The resultant learning algorithm is a non-linear transformation of the input space optimized to make KNN perform well in the low-dimensional feature space.

Neighborhood Component Analysis

Assuming a given set of N labeled training cases [math]\displaystyle{ (x^a,c^a),\ a=1, 2, 3, \ldots, N }[/math], where [math]\displaystyle{ x^a \in R^d }[/math], and [math]\displaystyle{ c^a \in \{1,2, \ldots, C\} }[/math]. For each training vector [math]\displaystyle{ \mathbf x^a }[/math], the probability that point [math]\displaystyle{ \mathbf a }[/math] selects one of its neighbours [math]\displaystyle{ \mathbf b }[/math] in the transformed feature space as, is defined as below:


[math]\displaystyle{ p_{ab}=\frac{exp(-d_{ab})}{\sum_{z \neq a} exp(-d_{az})} }[/math]

Assuming Euclidean distance metric we have:


[math]\displaystyle{ \mathbf d_{ab}=||f(x^a|W)-f(x^b|W) ||^2 }[/math]


If [math]\displaystyle{ \mathbf f(x|W)=Wx }[/math] is constrained to be a linear transformation, we get linear NCA. However, here authors define [math]\displaystyle{ \mathbf f(x|W) }[/math] using a multi-layer, nonlinear neural network, parametrized by the weight vector [math]\displaystyle{ W }[/math]. The probability that point [math]\displaystyle{ a }[/math] belongs to class [math]\displaystyle{ k }[/math] depends on the relative proximity of all other data points that belong to class [math]\displaystyle{ k }[/math], i.e.


[math]\displaystyle{ \mathbf p(c^a=k)=\sum_{b:c^b=k} p_{ab} }[/math]

The NCA goal is to maximize the expected number of correctly classified points on the training data:


[math]\displaystyle{ \mathbf O_{NCA}= \sum_{a=1}^{N} \sum_{b:c^a=c^b} p_{ab} }[/math]

In order to maximize the above-mentioned objective function, we need to compute its derivative with respect to vector [math]\displaystyle{ W }[/math] for the [math]\displaystyle{ a^th }[/math] training case as below


[math]\displaystyle{ \mathbf \frac{\partial O_{NCA}}{\partial W} = \mathbf \frac{\partial O_{NCA}}{\partial f(x^a|W)} \mathbf \frac{\partial f(x^a|W)}{\partial W} }[/math]

where


[math]\displaystyle{ \mathbf \frac{\partial O_{NCA}}{\partial f(x^a|W)}= -2 [\sum_{b:c^a=c^b} p_{ab}d_{ab} - \sum_{b:c^a=c^b} p_{ab} [\sum_{z \neq a} p_{az}d_{az}]] + 2[\sum_{l:c^l=c^a} p_{la}d_{la} - \sum_{l \neq a} p_{la}d_{la}[\sum_{q:c^l=c^q} p_{lq}] ] }[/math]

and [math]\displaystyle{ \mathbf \frac{\partial f(x^a|W)}{\partial W} }[/math] is computed using the standard backpropagation algorithm.

For a more detailed discussion on NCA click on the link below:

Neighbourhood Components Analysis

Pre-training step

Fine-tuning step

Regularized Nonlinear NCA

In many applications, a large supply of unlabeled data is readily available but the amount of labeled data (which typically requires expert knowledge to produce), is very limited. In order to take advantage of the information in the unlabeled data to enhance nonlinear NCA performance, the regularized NCA framework is proposed. Once the pretarining step for individual RBM's is each level of the network is done, one can replace the stochastic activities of the binary features by deterministic, real-valued probabilities. This allows for backpropagating through the entire network to fine-tune the weights for optimal reconstruction of the data.

Performing such training step does not require labeled data and produces low-dimensional codes that are good at reconstructing the input data vectors, and tend to preserve class neighbourhood structure. Accordingly, a new objective function [math]\displaystyle{ C }[/math] is formulated combining NCA and autoencoder objective functions:


[math]\displaystyle{ \mathbf C=\lambda O_{NCA}+(1-\lambda)(-E) }[/math]

Splitting codes into class-relevant and class-irrelevant parts

Experiments

Experimental results presented for the MNIST dataset containing 60000 training and 10000 test images of 28x28 handwritten digits, show the superior performance of the proposed algorithm (i.e. nonlinear NCA). An 28x28-500-500-2000-30 architecture is used for the multi-layer neural network.

Obtained results show that nonlinear NCA achieves minimum error rate of 1.01% using 7 nearest neighbours. This compares favorably to the best reported error rates (without using any domain-specific knowledge) of 1.6% for randomly initialized backpropagation and 1.4% for Support Vector Machines.

Figures are available in the original paper.

References

<references/>