Difference between revisions of "the Manifold Tangent Classifier"
(→Discussion) 
m (Conversion script moved page The Manifold Tangent Classifier to the Manifold Tangent Classifier: Converting page titles to lowercase) 
(No difference)

Latest revision as of 09:46, 30 August 2017
Contents
 1 Introduction
 2 Contractive AutoEncoders (CAE) and Tangent Classification
 3 Related Work
 4 Results
 5 Conclusion
 6 Discussion
 7 Bibliography
Introduction
The goal in many machine learning problems is to extract information from data with minimal prior knowledge<ref name = "main"> Rifai, S., Dauphin, Y. N., Vincent, P., Bengio, Y., & Muller, X. (2011). The manifold tangent classifier. In Advances in Neural Information Processing Systems (pp. 22942302). </ref> These algorithms are designed to work on numerous problems which they may not be specifically tailored towards, thus domainspecific knowledge is generally not incorporated into the models. However, some generic "prior" hypotheses are considered to aid in the general task of learning, and three very common ones are presented below:
 The semisupervised learning hypothesis: This states that knowledge of the input distribution [math]p\left(x\right)[/math] can aid in learning the output distribution [math]p\left(yx\right)[/math] .<ref>Lasserre, J., Bishop, C. M., & Minka, T. P. (2006, June). Principled hybrids of generative and discriminative models. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on (Vol. 1, pp. 8794). IEEE.</ref> This hypothesis lends credence to not only the theory of strict semisupervised learning, but also unsupervised pretraining as a method of feature extraction.<ref> Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7), 15271554.</ref><ref>Erhan, D., Bengio, Y., Courville, A., Manzagol, P. A., Vincent, P., & Bengio, S. (2010). Why does unsupervised pretraining help deep learning?. The Journal of Machine Learning Research, 11, 625660.</ref>
 The unsupervised manifold hypothesis: This states that realworld data presented in highdimensional spaces is likely to concentrate around a lowdimensional submanifold.<ref>Cayton, L. (2005). Algorithms for manifold learning. Univ. of California at San Diego Tech. Rep, 117.</ref>
 The manifold hypothesis for classification: This states that points of different classes are likely to concentrate along different submanifolds, separated by lowdensity regions of the input space.<ref name = "main"></ref>
The recentlyproposed Contractive AutoEncoder (CAE) algorithm has shown success in the task of unsupervised feature extraction,<ref name = "CAE">Rifai, S., Vincent, P., Muller, X., Glorot, X., & Bengio, Y. (2011). Contractive autoencoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on Machine Learning (ICML11) (pp. 833840).</ref> with its successful application in pretraining of Deep Neural Networks (DNN) an illustration of the merits of adopting Hypothesis 1. CAE also yields a mostly contractive mapping that is locally only sensitive to a few input directions, which implies that it models a lowerdimensional manifold (exploiting Hypothesis 2) since the directions of sensitivity are in the tangent space of the manifold.
This paper furthers the previous work by using the information about the tangent spaces by considering Hypothesis 3: it extracts basis vectors for the local tangent space around each training point from the parameters of the CAE. Then, older supervised classification algorithms that exploit tangent directions as domainspecific prior knowledge can be used on the tangent spaces generated by CAE for finetuning the overall classification network. This approach seamlessly integrates all three of the above hypotheses and produces recordbreaking results (for 2011) on image classification.
Contractive AutoEncoders (CAE) and Tangent Classification
The problem is to find a nonlinear feature extractor for a dataset [math]\mathcal{D} = \{x_1, \ldots, x_n\}[/math], where [math]x_i \in \mathbb{R}^d[/math] are i.i.d. samples from an unknown distribution [math] p\left(x\right)[/math].
Traditional AutoEncoders
A traditional autoencoder learns an encoder function [math]h: \mathbb{R}^d \rightarrow \mathbb{R}^{d_h}[/math] along with a decoder function [math]g: \mathbb{R}^{d_h} \rightarrow \mathbb{R}[/math], represented as [math]r = g\left(h\left(x\right)\right) [/math]. [math]h\,[/math] maps input [math]x\,[/math] to the hidden input space, and [math]g\,[/math] reconstructs [math]x\,[/math]. When [math]L\left(x,g\left(h\left(x\right)\right)\right)[/math] denotes the average reconstruction error, the objective function being optimized to learn the parameters [math]\theta\,[/math] of the encoder/decoder is as follows:
 [math] \mathcal{J}_{AE}\left(\theta\right) = \sum_{x\in\mathcal{D}}L\left(x,g\left(h\left(x\right)\right)\right) [/math]
The form of the encoder is [math]h\left(x\right) = s\left(Wx + b_h\right)[/math], where [math]s\left(z\right) = \frac{1}{1 + e^{z}}[/math] is the elementwise logistic sigmoid. [math]W \in \mathbb{R}^{d_h \times d} [/math] and [math]b_h \in \mathbb{R}^{d_h}[/math] are the parameters (weight matrix and bias vector, respectively). The form of the decoder is [math]r = g\left(h\left(x\right)\right) = s_2\left(W^Th\left(x\right)+b_r\right)[/math], where [math]\,s_2 = s[/math] or the identity. The weight matrix [math]W^T\,[/math] is shared with the encoder, with the only new parameter being the bias vector [math]b_r \in \mathbb{R}^d[/math].
The loss function can either be the squared error [math]L\left(x,r\right) = \x  r\^2[/math] or the Bernoulli crossentropy, given by:
 [math] L\left(x, r\right) = \sum_{i=1}^d \left[x_i \mbox{log}\left(r_i\right) + \left(1  x_i\right)\mbox{log}\left(1  r_i\right)\right][/math]
First and HigherOrder Contractive AutoEncoders
Additional Penalty on Jacobian
The Contractive AutoEncoder (CAE), proposed by Rifai et al.<ref name = "CAE"></ref>, encourages robustness of [math]h\left(x\right)[/math] to small variations in [math]x[/math] by penalizing the Frobenius norm of the encoder's Jacobian [math]J\left(x\right) = \frac{\partial h}{\partial x}\left(x\right)[/math]. The new objective function to be minimized is:
 [math] \mathcal{J}_{CAE}\left(\theta\right) = \sum_{x\in\mathcal{D}}L\left(x,g\left(h\left(x\right)\right)\right) + \lambda\J\left(x\right)\_F^2 [/math]
where [math]\lambda[/math] is a nonnegative regularization parameter. We can compute the [math]j^{th}[/math] row of the Jacobian of the sigmoidal encoder quite easily using the [math]j^{th}[/math] row of [math]W[/math]:
 [math] J\left(x\right)_j = \frac{\partial h_j\left(x\right)}{\partial x} = h_j\left(x\right)\left(1  h_j\left(x\right)\right)W_j[/math]
Additional Penalty on Hessian
It is also possible to penalize higherorder derivatives by approximating the Hessian (explicit computation of the Hessian is costly). It is sufficient to penalize the difference between [math]J\left(x\right)[/math] and [math]J\left(x + \varepsilon\right)[/math] where [math]\,\varepsilon [/math] is small, as this represents the rate of change of the Jacobian. This yields the "CAE+H" variant, with objective function as follows:
 [math] \mathcal{J}_{CAE+H}\left(\theta\right) = \mathcal{J}_{CAE}\left(\theta\right) + \gamma\sum_{x \in \mathcal{D}}\mathbb{E}_{\varepsilon\sim\mathcal{N}\left(0,\sigma^2I\right)} \left[\J\left(x\right)  J\left(x + \varepsilon\right)\^2\right] [/math]
The expectation above, in practice, is taken over stochastic samples of the noise variable [math]\varepsilon\,[/math] at each stochastic gradient descent step. [math]\gamma\,[/math] is another regularization parameter. This formulation will be the one used within this paper.
Characterizing the Tangent Bundle Captured by a CAE
Although the regularization term encourages insensitivity of [math]h(x)[/math] in all input space directions, the pressure to form an accurate reconstruction counters this somewhat, and the result is that [math]h(x)[/math] is only sensitive to a few input directions necessary to distinguish closeby training points.<ref name = "CAE"></ref> Geometrically, the interpretation is that these directions span the local tangent space of the underlying manifold the characterizes the input data.
Geometric Terms
 Tangent Bundle: The tangent bundle of a smooth manifold is the manifold along with the set of tangent planes taken at all points in it.
 Chart: A local Euclidean coordinate system equipped to a tangent plane. Each tangent plane has its own chart.
 Atlas: A collection of local charts.
Conditions for Feature Mapping to Define an Atlas on a Manifold
To obtain a proper atlas of charts, [math]h[/math] must be a local diffeomorphism (locally smooth and invertible). Since the sigmoidal mapping is smooth, [math]\,h[/math] is guaranteed to be smooth. To determine injectivity of [math]h\,[/math], consider the following, [math]\forall x_i, x_j \in \mathcal{D}[/math]:
 [math] \begin{align} h(x_i) = h(x_j) &\Leftrightarrow s\left(Wx_i + b_h\right) = s\left(Wx_j + b_h\right) \\ & \Leftrightarrow Wx_i + b_h = Wx_j + b_h \mbox{, since } s \mbox{ is invertible} \\ & \Leftrightarrow W\Delta_{ij} = 0 \mbox{, where } \Delta_{ij} = x_i  x_j \end{align} [/math]
Thus, as long as [math]W\,[/math] forms a basis spanned by its rows [math]W_k\,[/math] such that [math]\forall i,j \,\,\exists \alpha \in \mathbb{R}^{d_h}  \Delta_{ij} = \sum_{k=1}^{d_h}\alpha_k W_k[/math], then the injectivity of [math]h\left(x\right)[/math] will be preserved (as this would imply [math]\Delta_{ij} = 0\,[/math] above). Furthermore, if we limit the domain of [math]\,h[/math] to [math]h\left(\mathcal{D}\right) \subset \left(0,1\right)^{d_h}[/math], containing only the values obtainable by [math]h\,[/math] applied to the training set [math]\mathcal{D}[/math], then [math]\,h[/math] is surjective by definition. Therefore, [math]\,h[/math] will be bijective between [math]h\,[/math] and [math]h\left(\mathcal{D}\right)[/math], meaning that [math]h\,[/math] will be a local diffeomorphism around each point in the training set.
Generating an Atlas from a Learned Feature Mapping
We now need to determine how to generate local charts around each [math]x \in \mathcal{D}[/math]. Since [math]h[/math] must be sensitive to changes between [math]x_i[/math] and one of its neighbours [math]x_j[/math], but insensitive to other changes, we expect this to be encoded in the spectrum of the Jacobian [math]J\left(x\right) = \frac{\partial h}{\partial x}\left(x\right)[/math]. Thus, we define a local chart around [math]x[/math] using the singular value decomposition of [math]\,J^T(x) = U(x)S(x)V^T(x)[/math]. The tangent plane [math]\mathcal{H}_x[/math] at [math]\,x[/math] is then given by the span of the set of principal singular vectors [math]\mathcal{B}_x[/math], as long as the associated singular value is above a given small [math]\varepsilon\,[/math]:
 [math]\mathcal{B}_x = \{U_{:,k}(x)  S_{k,k}(x) \gt \varepsilon\} \mbox{ and } \mathcal{H}_x = \{x + v  v \in \mbox{span}\left(\mathcal{B}_x\right)\} [/math]
where [math]U_{:,k}(x)\,[/math] is the [math]k^{th}[/math] column of [math]U\left(x\right)[/math].
Then, we can define an atlas [math]\mathcal{A}[/math] captured by [math]h\,[/math], based on the local linear approximation around each example:
 [math] \mathcal{A} = \{\left(\mathcal{M}_x, \phi_x\right)  x\in\mathcal{D}, \phi_x\left(\tilde{x}\right) = \mathcal{B}_x\left(x  \tilde{x}\right)\}[/math]
Exploiting Learned Directions for Classification
We would like to use the local charts defined above as additional information for the task of classification. In doing so, we will adopt the manifold hypothesis for classification.
CAEBased Tangent Distance
We start by defining the tangent distance between two points as the difference between their two respective hyperplanes [math]\mathcal{H}_x, \mathcal{H}_y[/math] defined above, where distance is defined as:
 [math] d\left(\mathcal{H}_x,\mathcal{H}_y\right) = \mbox{inf}\{\z  w\^2\,\,  \left(z,w\right) \in \mathcal{H}_x \times \mathcal{H}_y\}[/math]
Finding this distance is a convex problem which is solvable by solving a system of linear equations.<ref>Simard, P., LeCun, Y., & Denker, J. S. (1993). Efficient pattern recognition using a new transformation distance. In Advances in neural information processing systems (pp. 5058).</ref> Minimizing the distance in this way allows [math]x, y \in \mathcal{D}[/math] to move along their associated tangent spaces, and have the distance evaluated where [math]x[/math] and [math]y[/math] are closest. A nearestneighbour classifier could then be used based on this distance.
CAEBased Tangent Propagation
Nearestneighbour techniques work in theory, but are often impractical for largescale datasets. Classifying test points in this way grows linearly with the number of training points. Neural networks, however, can quickly classify test points once they are trained. We would like the output [math]o[/math] of the classifier to be insensitive to variations in the directions of the local chart around [math]x[/math]. To this end, we add the following penalty to the objective function of the (supervised) network:
 [math] \Omega\left(x\right) = \sum_{u \in \mathcal{B}_x} \left\left \frac{\partial o}{\partial x}\left(x\right) u \right\right^2 [/math]
The Manifold Tangent Classifier (MTC)
Finally, we are able to put all of the results together into a full algorithm for training a network. The steps follow below:
 Train (unsupervised) a stack of [math]K\,[/math] CAE+H layers as in section 2.2.2. Each layer is trained on the representation learned by the previous layer.
 For each [math]x_i \in \mathcal{D}[/math], compute the Jacobian of the last layer representation [math]J^{(K)}(x_i) = \frac{\partial h^{(K)}}{\partial x}\left(x_i\right)[/math] and its SVD. Note that [math]J^{(K)}\,[/math] is the product of the Jacobians of each encoder. Store the leading [math]d_M\,[/math] singular vectors in [math]\mathcal{B}_{x_i}[/math].
 After the [math]K\,[/math] CAE+H layers, add a sigmoidal output layer with a node for each class. Train the entire network for supervised classification, adding in the propagation penalty in 2.4.2. Note that for each [math]x_i, \mathcal{B}_{x_i}[/math] contains the set of tangent vectors to use.
Related Work
There are a number of existing nonlinear manifold learning algorithms (e.g. <ref>A Global Geometric Framework for Nonlinear Dimensionality Reduction Tenenbaum et al., Science (2000)</ref> that learn the tangent bundle for a set of training points (i.e. the main directions of variation around each point). One drawback of these existing approaches is that they are typically nonparametric and use local parameters to define the tangent plane around each datapoint. This potentially results in manifold learning algorithms that require training data that grows exponentially with manifold dimension and curvature.
The semisupervised embedding algorithm <ref>Deep learning via semisupervised embedding Weston et al., ICML (2008) </ref> is also related in that it encourages the hidden states of a network to be invariant with respect to changes to neighbouring datapoints in the training. The present work, however, initially aims for representations that are sensitive to such local variations, as explained above.
Results
Datasets Considered
The MTC was tested on the following datasets:
 MNIST: Set of 28 by 28 images of handwritten digits, and the goal is to predict the digit contained in the image.
 Reuters Corpus Volume I: Contains 800,000 realworld news stories. Used the 2000 most frequent words calculated on the whole dataset to create a bagofwords representation.
 CIFAR10: Dataset of 70,000 32 by 32 RGB realworld images.
 Forest Cover Type: Largescale database of cartographic variables for prediction of forest cover types.
Method
To investigate the improvements made by CAElearned tangents, the following method is employed: Optimal hyperparameters (e.g. [math]\gamma, \lambda\,,[/math] etc.) were selected by crossvalidation on a disjoint validation set disjoint from the training set. The quality of the features extracted by the CAE is evaluated by initializing a standard multilayer perceptron network with the same parameters as the trained CAE and finetuning it by backpropagation on the supervised task.
Visualization of Learned Tangents
Figure 1 visualizes the tangents learned by CAE. The example is on the left, and 8 tangents are shown to the right. On the MNIST dataset, the tangents are small geometric transformations. For CIFAR10, the tangents appear to be parts of the image. For Reuters, the tangents correspond to addition/removal of similar words, with the positive terms in green and the negative terms in red. We see that the tangents do not seem to change the class of the example (e.g. the tangents of the above "0" in MNIST all resemble zeroes).
MTC in SemiSupervised Setting
The MTC method was evaluated on the MNIST dataset in a semisupervised setting: the unsupervised feature extractor is trained on the full training set, and the supervised classifier is only trained on a restricted label set. The results with a single layer perceptron initialized with CAE+H pretraining (abbreviated CAE), and the same classifier with tangent propagation added (i.e. MTC) are in table 1. The performance is compared to other methods the do not consider the semisupervised learning hypothesis (Support Vector Machines (SVM), Neural Networks (NN), Convolutional Neural Networks (CNN)), and those methods perform poorly against MTC, especially when labeled data is low.
# Labeled  NN  SVM  CNN  CAE  MTC 
100  25.81  23.44  22.98  13.47  12.03 
600  11.44  8.85  7.68  6.3  5.13 
1000  10.7  7.77  6.45  4.77  3.64 
3000  6.04  4.21  3.35  3.22  2.57 
MTC in Full Classification Problems
We consider using MTC to classify using the full MNIST dataset (i.e. the fully supervised problem), and compare with other methods. The CAE used for tangent discovery is a twolayer deep network with 2000 units perlayer pretrained with the CAE+H objective. The MTC uses the same stack of CAEs trained with tangent propagation, using [math]d_M = 15\,[/math] tangents. The MTC produces stateoftheart results, achieving a 0.81% error on the test set (as opposed to the previous stateoftheart result of 0.95% error, achieved by Deep Boltzmann Machines). Table 2 summarizes this result. Note that MTC also beats out CNN, which utilizes prior knowledge about vision using convolutions and pooling.
KNN  NN  SVM  CAE  DBM  CNN  MTC 
3.09%  1.60%  1.40%  1.04%  0.95%  0.95%  0.81% 
A 4layer MTC was trained on the Forest CoverType dataset. The MTC produces the best performance on this classification task, beating out the previous best method which used a mixture of nonlinear SVMs (denoted as distributed SVM).
SVM  Distributed SVM  MTC 
4.11%  3.46%  3.13% 
Conclusion
This paper unifies three common generic prior hypotheses in a unified manner. It uses a semisupervised manifold approach to examine local charts around training points in the data, and then uses the tangents generated by these local charts to compare different classes. The tangents that are generated seem to be a meaningful decompositions of the training examples. When combining the tangents with the classifier, stateoftheart results are obtained on classification problems in a variety of domains.
Discussion
 I thought about how it could be possible to use an elementwise rectified linear unit [math]R\left(x\right) = \mbox{max}\left(0,x\right)[/math] in place of the sigmoidal function for encoding, as this type of functional form has seen success in other deep learning methods. However, I believe that this type of functional form would preclude [math]h[/math] from being diffeomorphic, as the [math]x[/math]values that are negative could not possibly be reconstructed. Thus, the sigmoidal form should likely be retained, although it would be interesting to see how other invertible nonlinearities would perform (e.g. hyperbolic tangent).
 It would be interesting to investigate applying the method of tangent extraction to other unsupervised methods, and then create a classifier based on these tangents in the same way that it is done in this paper. Further work could be done to apply this approach to clustering algorithms, kernel PCA, EM, etc. This is more of a suggestion than a concrete idea, however.
 It is not exactly clear to me how a [math]h[/math] could ever define a true diffeomorphism, since [math]h: \mathbb{R}^{d} \mapsto \mathbb{R}^{d_h}[/math], where [math]d \ne d_h[/math], in general. Clearly, if [math]d \gt d_h[/math], such a map could not be injective. However, they may be able to "manufacture" the injectivity of [math]h[/math] using the fact that [math]\mathcal{D}[/math] is a discrete set of points. I'm not sure that this approach defines a continuous manifold, but I'm also not sure if that really matters in this case.
Bibliography
<references />