# Introduction

The paper <ref>Salakhutdinov, R., & Hinton, G. E. (2007). Learning a nonlinear embedding by preserving class neighbourhood structure. AI and Statistics.</ref> presented here describes a method to learn a nonlinear transformation from the input space to a low-dimensional feature space in which K-nearest neighbour classification performs well. As the performance of algorithms like K-nearest neighbours (KNN) that are based on computing distances, the main objective of the proposed algorithm is to learn a good similarity measure that can provide insight into how high-dimensional data is organized. The nonlinear transformation is learned by pre-training and fine-tuning a multilayer neural network. The authors also show how to further enhance the performance of non-linear transformation using unlabeled data. Experimental results on a widely used version of the MNIST handwritten digit recognition task show that proposed algorithm achieves a much lower error rate than SVM or standard backpropagation.

# Background and Related Work

Learning a similarity measure (or distance metric) over the input space ${\mathbf X}$ is an important task in machine learning, and is closely related to the feature extraction problem. A distance metric $\mathbf D$ (e. g. Euclidean) measures the similarity between two input vectors ${\mathbf x}^a, {\mathbf x}^b \in {\mathbf X}$ by computing $\mathbf D[{\mathbf f}(x^a|W),{\mathbf f}(x^b|W)]$, where ${\mathbf f}(x|W)$ represents the mapping function from input vector ${\mathbf X}$ to feature space ${\mathbf Y}$ parametrized by ${\mathbf W}$. Previous work studied this problem where $\mathbf D$ is the Euclidean distance and ${\mathbf f}$ is simple linear projection, i.e. ${\mathbf f}(x|W)=Wx$. For example Linear discriminant analysis (LDA) learns the matrix $W$ that minimizes the within-class distances to between-class distances ratio.

Globerson and Roweis <ref> A. Globerson and S. T. Roweis. Metric learning by collapsing classes. In NIPS, 2005 </ref> proposed a method for learning the matrix $W$ such that the input vectors from the same class are mapped to a tight cluster. Also, Weinberger et.al. <ref> K. Q.Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS, 2005. </ref> also learned $W$ with the goals of both making the K-nearest neighbours belong to the same class and making examples from different classes be separated by a large margin. All these methods rely on linear transformation, which has a limited number of parameters and thus cannot model higher-order correlations between the original data dimensions.

# Proposed Method

In this paper, authors show that a nonlinear transformation function, with many more parameters, enables us to discover low-dimensional representations of high-dimensional data that perform much better than existing linear methods provided the dataset is large enough to allow for the parameters to be estimated. Regarding the digit recognition application considered in the paper and adopting a probabilistic approach, one can learn the non-linear transformation by maximizing the log probability of the pairs that occur in the training set. The probability distribution over all possible pairs of images $\mathbf x^a, \mathbf x^b$, is defined using the squared distances between their codes, ${\mathbf f}(x^a),{\mathbf f}(x^b)$:

$\mathbf p(x^a,x^b)= \frac{||f(x^a)-f(x^b)||^2}{\sum_{k\lt l} ||f(x^k)-f(x^l)||^2}$

This formulation is an attempt to model the structure in the pairings, not the structure in the individual images or the mutual information between the code vectors. An alternative approach used here is based on a recent discovery of an effective and unsupervised algorithm for training a multi-layer, non-linear ”encoder” network that transforms the input data vector $\mathbf X$ into a low-dimensional feature representation $\mathbf f(x|W)$ capturing a lot of the structure in the input data <ref> G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006. </ref>.

They proposed algorithm performs two steps: First, the recently discovered unsupervised algorithm is used in a pre-training stage, i.e.to initialize the parameter vector $W$ that defines the mapping from input vectors to their low-dimensional representation. Next, the initial parameters are finetuned by performing gradient descent in the objective function defined by Neighbourhood Component Analysis (NCA) method <ref> J. Goldberger, S. T. Roweis, G. E. Hinton, and Ruslan Salakhutdinov. Neighbourhood components analysis. In NIPS, 2004 </ref>. The resultant learning algorithm is a non-linear transformation of the input space optimized to make KNN perform well in the low-dimensional feature space.

## Neighborhood Component Analysis

Assuming a given set of N labeled training cases $(x^a,c^a),\ a=1, 2, 3, \ldots, N$, where $x^a \in R^d$, and $c^a \in \{1,2, \ldots, C\}$. For each training vector $\mathbf x^a$, the probability that point $\mathbf a$ selects one of its neighbours $\mathbf b$ in the transformed feature space as, is defined as below:

$p_{ab}=\frac{exp(-d_{ab})}{\sum_{z \neq a} exp(-d_{az})}$

Assuming Euclidean distance metric we have:

$\mathbf d_{ab}=||f(x^a|W)-f(x^b|W) ||^2$

If $\mathbf f(x|W)=Wx$ is constrained to be a linear transformation, we get linear NCA. However, here authors define $\mathbf f(x|W)$ using a multi-layer, nonlinear neural network, parametrized by the weight vector $W$. The probability that point $a$ belongs to class $k$ depends on the relative proximity of all other data points that belong to class $k$, i.e.

$\mathbf p(c^a=k)=\sum_{b:c^b=k} p_{ab}$

The NCA goal is to maximize the expected number of correctly classified points on the training data:

$\mathbf O_{NCA}= \sum_{a=1}^{N} \sum_{b:c^a=c^b} p_{ab}$

In order to maximize the above-mentioned objective function, we need to compute its derivative with respect to vector $W$ for the $a^th$ training case as below

$\mathbf \frac{\partial O_{NCA}}{\partial W} = \mathbf \frac{\partial O_{NCA}}{\partial f(x^a|W)} \mathbf \frac{\partial f(x^a|W)}{\partial W}$

where

$\mathbf \frac{\partial O_{NCA}}{\partial f(x^a|W)}= -2 [\sum_{b:c^a=c^b} p_{ab}d_{ab} - \sum_{b:c^a=c^b} p_{ab} [\sum_{z \neq a} p_{az}d_{az}]] + 2[\sum_{l:c^l=c^a} p_{la}d_{la} - \sum_{l \neq a} p_{la}d_{la}[\sum_{q:c^l=c^q} p_{lq}] ]$

and $\mathbf \frac{\partial f(x^a|W)}{\partial W}$ is computed using the standard backpropagation algorithm.

For a more detailed discussion on NCA click on the link below:

## Pre-training step

The purpose of this step if to learn the (initial) weights for an adaptive, multi-layer, non-linear encoder network that transforms the input data vector $x$ into its low-dimensional feature representation $\mathbf f(x|W)$. Indeed, the pretraining step should find a good region from which to start the subsequent fine-tuning step.

### modeling binary data

MNIST dataset contains binary images modeled here using Restricted Boltzmann Machines (RBM) <ref> Y. Freund and D. Haussler. Unsupervised learning of distributions on binary vectors using two layer networks. In Advances in Neural Information Processing Systems 4, pages 912–919, San Mateo, CA., 1992. Morgan Kaufmann </ref>, which are a type of stochastic recurrent neural network.

The visible stochastic binary input vector $\mathbf x$ and hidden stochastic binary feature vector $\mathbf h$ are modeled by products of conditional Bernoulli distributions:

$\mathbf p(h_j=1|x)= \sigma(b_j+\sum_{i}W_{ij}x_i)$

$\mathbf p(x_i=1|h)= \sigma(b_i+\sum_{j}W_{ij}j_j)$

Where $\sigma(z) = \frac{1}{1+exp(-z)}$ is the logistic function and $\mathbf W_{ij}$ is a symmetric interaction term between input $\mathbf i$ and feature $\mathbf j$, and $b_i\ and \ b_j$ are biases.

The marginal distribution over visible vector $\mathbf x$ is then:

$\mathbf p(x)=\sum_{h} \frac{exp(-E(h,x))}{\sum_{u,g} exp(-E(u,g))}$

Where $\mathbf E(x,h)$ is an energy term defined as below:

$\mathbf E(x,h) = - \sum_{i} b_ix_i - \sum_{j} b_jh_j - \sum_{i,j} x_ih_jW_{ij}$

Finally, the parameter update term to perform gradient ascent in the log-likelihood is

$\mathbf \Delta W_{ij}=\epsilon \frac{\partial log p(x)}{\partial W_{ij}} = \epsilon (\lt x_i,h_j\gt _{data} - \lt x_i,h_j\gt _{model})$

With $\mathbf \epsilon$ being the learning rate, $\mathbf \lt .\gt _{data}$ denoting the the frequency with which input $\mathbf i$ and feature $\mathbf j$ are on together when the features are being driven by the observed data from the training set, and $\mathbf \lt .\gt _{model}$ being the expectation (corresponding to $\mathbf \lt .\gt _{data}$) with respect to the distribution defined by the model.

As computations for $\mathbf \lt .\gt _{model}$ are difficult, the 1-step divergence algorithm is used to circumvent this burden <ref> G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1711–1800, 2002</ref>:

$\mathbf \Delta W_{ij}= \epsilon (\lt x_i,h_j\gt _{data} - \lt x_i,h_j\gt _{recon})$

Where $\mathbf \lt .\gt _{recon}$ is the frequency with which input $\mathbf i$ and feature $\mathbf j$ are on together after stochastically activating features and reconstructing binary data.

### Greedy recursive pretraining

This greedy, layer-by-layer training can be repeated several times to learn a deep, hierarchical model in which each layer of features captures strong high-order correlations between the activities of features in the layer below:

1. Learn the parameters $\mathbf W^1$ of a Bernoulli or Gaussian model.

2. Freeze the parameters of the lower-level model and use the activation probabilities of the binary features, when they are being driven by training data, as the data for training the next layer of binary features.

3. Freeze the parameters $\mathbf W^2$ that define the second layer of features and use the activation probabilities of those features as data for training the third layer of features.

4. Proceed recursively for as many layers as desired.

## Fine-tuning step

For fine-tuning model parameters using the NCA objective function (introduced above) the method of conjugate gradients is used. To determine an adequate number of epochs and avoid overfitting, only a fraction of the training data is used for fine-tuning, and then performance is tested on the remaining validation data.

## Regularized Nonlinear NCA

In many applications, a large supply of unlabeled data is readily available but the amount of labeled data (which typically requires expert knowledge to produce), is very limited. In order to take advantage of the information in the unlabeled data to enhance nonlinear NCA performance, the regularized NCA framework is proposed. Once the pretarining step for individual RBM's is each level of the network is done, one can replace the stochastic activities of the binary features by deterministic, real-valued probabilities. This allows for backpropagating through the entire network to fine-tune the weights for optimal reconstruction of the data.

Performing such training step does not require labeled data and produces low-dimensional codes that are good at reconstructing the input data vectors, and tend to preserve class neighbourhood structure. Accordingly, a new objective function $C$ is formulated combining NCA and autoencoder objective functions:

$\mathbf C=\lambda O_{NCA}+(1-\lambda)(-E)$

Where $\mathbf E$ is the reconstruction error and $\mathbf \lambda$ is a trade-off parameter.

For example assuming a semi-supervised learning setting, consider having a set of $\mathbf N_l$ labeled training data $\mathbf (x^l,c^l)$, and a set of $\mathbf N_u$ unlabeled training data $\mathbf x^u$. The overall objective function would then be as below:

$\mathbf O=\lambda \frac{1}{N_l} \sum_{l=1}^{N_l} \sum_{b:c^l=c^b} p_{lb}+ (1-\lambda) \frac{1}{N} \sum_{n=1}^{N} -E^n$

Where $\mathbf N=N_l+N_u$ and $\mathbf E^n$ is the reconstruction error for the input $\mathbf x^n$.

### Splitting codes into class-relevant and class-irrelevant parts

Accurate reconstruction of a digit image requires the code to contain information about aspects of the image such as its orientation, slant, size and stroke thickness. These aspects are not necessarily relevant to digit class may inevitably contribute to the Euclidean distance between codes and thus harm classification. To eliminate such unwanted effect, a 50-dimensional code is used, but only the first 30 dimensions are used in the NCA objective function calculations. The remaining 20 dimensions were free to code all those aspects of an image that do not affect its class label but are important for reconstruction, and are therefore used to compute the reconstruction errors.

# Experiments

Experimental results presented for the MNIST dataset containing 60000 training and 10000 test images of 28x28 handwritten digits, show the superior performance of the proposed algorithm (i.e. nonlinear NCA). An 28x28-500-500-2000-30 architecture is used for the multi-layer neural network.

Obtained results show that nonlinear NCA achieves minimum error rate of 1.01% using 7 nearest neighbours. This compares favorably to the best reported error rates (without using any domain-specific knowledge) of 1.6% for randomly initialized backpropagation and 1.4% for Support Vector Machines.

Figures are available in the original paper.

<references/>