extracting and Composing Robust Features with Denoising Autoencoders
Introduction
This Paper explores a new training principle for unsupervised learning of a representation based on the idea of making the learned representations robust to partial corruption of the input pattern. This approach can be used to train autoencoders, and these denoising autoencoders can be stacked to initialize deep architectures. The algorithm can be motivated from a manifold learning and information theoretic perspective or from a generative model perspective.
Motivation
The approach is based on the use of an unsupervised training criterion to perform a layer-by-layer initialization. The procedure is as follows : Each layer is at first trained to produce a higher level (hidden) representation of the observed patterns, based on the representation it receives as input from the layer below, by optimizing a local unsupervised criterion. Each level produces a representation of the input pattern that is more abstract than the previous level’s, because it is obtained by composing more operations. This initialization yields a starting point, from which a global fine-tuning of the model’s parameters is then performed using another training criterion appropriate for the task at hand.
This process gives better solutions than the one obtained by random initializations
The Denoising Autoencoder
A Denoising Autoencoder reconstructs a clean “repaired” input from a corrupted, partially destroyed one. This is done by first corrupting the initial input x to get a partially destroyed version x˜ by means of a stochastic mapping. In this paper the noise is added by zeroing a fixed number νd of components are chosen at random and leaving the rest untouched. Thus the objective function can be described as File:q1.png
The objective function minimized by stochastic gradient descent becomes: File:q3.png
where the loss function is the cross entropy of the model The denoising autoencoder can be shown in the figure as
Layer-wise Initialization and Fine Tuning
While training the denoising autoencoder k-th layer used as input for the (k + 1)-th, and the (k + 1)-th layer trained after the k-th has been trained. After a few layers have been trained, the parameters are used as initialization for a network optimized with respect to a supervised training criterion. This greedy layer-wise procedure has been shown to yield significantly better local minima than random initialization of deep networks, achieving better generalization on a number of tasks.
Analysis of the Denoising Autoencoder
Manifold Learning Perspective
The process of mapping a corrupted example to an uncorrupted one can be visualized in Figure 2, with a low-dimensional manifold near which the data concentrate. We learn a stochastic operator p(X|~X) that maps an ~X to an X.
The denoising autoencoder can thus be seen as a way to define and learn a
manifold. The intermediate representation Y = f(X) can be interpreted as a
coordinate system for points on the manifold (this is most clear if we force the
dimension of Y to be smaller than the dimension of X). More generally, one can
think of Y = f(X) as a representation of X which is well suited to capture the
main variations in the data, i.e., on the manifold. When additional criteria (such
as sparsity) are introduced in the learning model, one can no longer directly view
Y = f(X) as an explicit low-dimensional coordinate system for points on the
manifold, but it retains the property of capturing the main factors of variation
in the data.