deep Generative Stochastic Networks Trainable by Backprop: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
= Introduction =
= Introduction =


The Deep Learning boom that has been seen in recent years was spurred initially by research in unsupervised learning techniques. However, most of the major successes over the last few years have mostly been based on supervised techniques. A drawback for the unsupervised methods stems from their need for too many calculations and intractable sums in their models (inference, learning, sampling and partition functions). The paper presented puts forth an idea for a network that creates a model of a conditional distribution, <math>P(X|\bar{X})</math>, which can be seen as a local (usually) unimodal representation of <math>P(X)</math>. <math>\bar{X}</math> is a corrupted example of the original data <math>X</math>. The Generative Stochastic Network (GSN) combines arbitrary latent variables <math>H</math> that serve as input for a Markov chain which build in layers that eventually create a representation of the original data. Training of the network does not need Gibb's sampling or large partition functions but is trained with backpropagation and all the tools that come with it.  
The Deep Learning boom that has been seen in recent years was spurred initially by research in unsupervised learning techniques. However, most of the major successes over the last few years have mostly been based on supervised techniques. A drawback for the unsupervised methods stems from their need for too many calculations and intractable sums in their models (inference, learning, sampling and partition functions). The paper presented puts forth an idea for a network that creates a model of a conditional distribution, <math>P(X|\bar{X})</math>, which can be seen as a local (usually) unimodal representation of <math>P(X)</math>. <math>\bar{X}</math> is a corrupted example of the original data <math>{X}</math>. The Generative Stochastic Network (GSN) combines arbitrary latent variables <math>H</math> that serve as input for a Markov chain which build in layers that eventually create a representation of the original data. Training of the network does not need Gibb's sampling or large partition functions but is trained with backpropagation and all the tools that come with it.
 
In DBM, sampling P(x, h) is estimated based on inference and sampling (Blah algorithm). To obtain a gradient there are intractable sums that must to calculated, however there are ways around this. The problem with these methods is that they make strong assumptions. In essence, the sampling methods for these calculations are biased towards certain distribution types (i.e. small number of  modes). The attempt is to get around this.  




Line 15: Line 17:


= Generative Stochastic Network (GSN) =  
= Generative Stochastic Network (GSN) =  
GSN relies on estimating the transition operator of a Markov chain.  
The paper describes the Generative Stochastic Network as a generalization of generative denoising autoencoders <ref>Bengio_2013c</ref>. This can be said as the estimations of the data are based on noised sampling. As opposed to directly estimating the data distribution, the model ventures to parametrize the transition of a Markov chain. This is the change that allows the problem to be transformed into a problem more similar to a supervised training problem.  GSN relies on estimating the transition operator of a Markov chain, that is <math>P(x_t | x_{t-1}</math> or <math>P(x_t, h_t|x_{t-1}, h_{t-1})</math>, which contain a small number of important modes. This leads to a simple gradient of a partition function. Tries to leverage the strength of function approximation.  
[[File:figure_1_bengio.png |thumb|upright=1.5|  Figure 1]]
 
The estimation of P(X) is as follows: create \bar{X} from corrupted distribution C(\bar{X}|X). C is created by adding some type of noise to the original data. The model is then trained to reconstruct X from \bar{X} and thus obtain P(X|\bar{X}). This is easier to model then the whole of P(X) since P(X|\bar{X}) is dominated by fewer modes than P(X). Bayes rule then dictates that P(X|\bar{X}) = \frac{1}{z}C(\bar{X}|X)P(X), z is an independent normalizing constant. This leads to the ability to construct P(X) based off the other two distributions. This information was based off other work from Begio <ref>Bengio_Alain</ref>.
 
Using a parameterized model (i.e. a neural network) it was found that the apporoximation made by the model, P_{\theta}(X|\bar{X}) could be used to approximate P_{\theta}(X). The Markov chain distribution \pi(X) will eventually converge to P(X). Figure 2 shows this process.
 
Explain where the complexity went.
 
Training the GSN involves moving along a Markov chain that uses the transition distribution between nodes as a way to update the weights of the GSM. The transition distibution (f(h,h', x)) is trained to maximize reconstruction likelihood
 
[[File:figure_1_bengio.png |thumb|upright=1|  Figure 1]]


[[File:figure_2_bengio.png |thumb|upright=1.5|  Figure 2]]
[[File:figure_2_bengio.png |thumb|upright=1|  Figure 2]]




Line 26: Line 37:
== MNIST ==
== MNIST ==


The non-linearity for the units in the GSN was applied as <math display="block">h_i = \eta_{out} + \tanh (\eta_{in} + a_i)</math>, with a_i as the linear activation for unit i and \eta_{in} and \eta_{out} are both zero mean Gaussian noise. Sampling of unfinished or incomplete data can be done in a similar manner to DBM, where representations can propagate upwards and downwards in the network. This allows for pattern completion similar to that achieved by DBM. The third image in Figure 3 demonstrates the GSN's ability to move from only half an image (where the rest is noise) and complete the digit, showing it has a internal representation of the digit that can be sampled to complete the digit.  
The non-linearity for the units in the GSN was applied as <math display="block"> h_i = \eta_{out} + \tanh (\eta_{in} + a_i) </math>, with a_i as the linear activation for unit i and \eta_{in} and \eta_{out} are both zero mean Gaussian noise. Sampling of unfinished or incomplete data can be done in a similar manner to DBM, where representations can propagate upwards and downwards in the network. This allows for pattern completion similar to that achieved by DBM. The third image in Figure 3 demonstrates the GSN's ability to move from only half an image (where the rest is noise) and complete the digit, showing it has a internal representation of the digit that can be sampled to complete the digit.  


<math display="block">\sum_{i=0}^\infty 2^{-i}</math>


[[File:figure_3_bengio.png |thumb|upright=2|  Figure 3]]
[[File:figure_3_bengio.png |thumb|centre|upright=2|  Figure 3]]
This is sentences that appear next to the image  
This is sentences that appear next to the image  


== Faces ==
== Faces ==


[[File:figure_4_bengio.png |thumb | upright=2|left | Figure 4]]
[[File:figure_4_bengio.png |thumb | upright=2|left | Figure 4]]

Revision as of 21:31, 18 November 2015

Introduction

The Deep Learning boom that has been seen in recent years was spurred initially by research in unsupervised learning techniques. However, most of the major successes over the last few years have mostly been based on supervised techniques. A drawback for the unsupervised methods stems from their need for too many calculations and intractable sums in their models (inference, learning, sampling and partition functions). The paper presented puts forth an idea for a network that creates a model of a conditional distribution, [math]\displaystyle{ P(X|\bar{X}) }[/math], which can be seen as a local (usually) unimodal representation of [math]\displaystyle{ P(X) }[/math]. [math]\displaystyle{ \bar{X} }[/math] is a corrupted example of the original data [math]\displaystyle{ {X} }[/math]. The Generative Stochastic Network (GSN) combines arbitrary latent variables [math]\displaystyle{ H }[/math] that serve as input for a Markov chain which build in layers that eventually create a representation of the original data. Training of the network does not need Gibb's sampling or large partition functions but is trained with backpropagation and all the tools that come with it.

In DBM, sampling P(x, h) is estimated based on inference and sampling (Blah algorithm). To obtain a gradient there are intractable sums that must to calculated, however there are ways around this. The problem with these methods is that they make strong assumptions. In essence, the sampling methods for these calculations are biased towards certain distribution types (i.e. small number of modes). The attempt is to get around this.


Unsupervised learning is attractive because the quantity of unlabelled data far exceeds that of labelled data

Avoiding intractable sums or maximization that is inherent in many unsupervised techniques

Generalize autoencoders

GSN parametrize transition operators of Markov chain rather than P(X). Allows for training of unsupervised methods by gradient descent and ML no partition functions, just backprop

graphical models have too many computations (inference, sampling, learning) MCMC can be used for estimation if only a few terms dominate the weighted sum that is being calculated.

Generative Stochastic Network (GSN)

The paper describes the Generative Stochastic Network as a generalization of generative denoising autoencoders <ref>Bengio_2013c</ref>. This can be said as the estimations of the data are based on noised sampling. As opposed to directly estimating the data distribution, the model ventures to parametrize the transition of a Markov chain. This is the change that allows the problem to be transformed into a problem more similar to a supervised training problem. GSN relies on estimating the transition operator of a Markov chain, that is [math]\displaystyle{ P(x_t | x_{t-1} }[/math] or [math]\displaystyle{ P(x_t, h_t|x_{t-1}, h_{t-1}) }[/math], which contain a small number of important modes. This leads to a simple gradient of a partition function. Tries to leverage the strength of function approximation.

The estimation of P(X) is as follows: create \bar{X} from corrupted distribution C(\bar{X}|X). C is created by adding some type of noise to the original data. The model is then trained to reconstruct X from \bar{X} and thus obtain P(X|\bar{X}). This is easier to model then the whole of P(X) since P(X|\bar{X}) is dominated by fewer modes than P(X). Bayes rule then dictates that P(X|\bar{X}) = \frac{1}{z}C(\bar{X}|X)P(X), z is an independent normalizing constant. This leads to the ability to construct P(X) based off the other two distributions. This information was based off other work from Begio <ref>Bengio_Alain</ref>.

Using a parameterized model (i.e. a neural network) it was found that the apporoximation made by the model, P_{\theta}(X|\bar{X}) could be used to approximate P_{\theta}(X). The Markov chain distribution \pi(X) will eventually converge to P(X). Figure 2 shows this process.

Explain where the complexity went.

Training the GSN involves moving along a Markov chain that uses the transition distribution between nodes as a way to update the weights of the GSM. The transition distibution (f(h,h', x)) is trained to maximize reconstruction likelihood

File:figure 1 bengio.png
Figure 1
File:figure 2 bengio.png
Figure 2


Experimental Results

Some initial experimental results were created without extensive parameter alteration. This was done to maintain consistency over the tests and likely to show that even without optimization that the results approached the performance of more established unsupervised learning networks. The main comparison was made to Deep Boltzmann Machines.

MNIST

The non-linearity for the units in the GSN was applied as [math]\displaystyle{ h_i = \eta_{out} + \tanh (\eta_{in} + a_i) }[/math], with a_i as the linear activation for unit i and \eta_{in} and \eta_{out} are both zero mean Gaussian noise. Sampling of unfinished or incomplete data can be done in a similar manner to DBM, where representations can propagate upwards and downwards in the network. This allows for pattern completion similar to that achieved by DBM. The third image in Figure 3 demonstrates the GSN's ability to move from only half an image (where the rest is noise) and complete the digit, showing it has a internal representation of the digit that can be sampled to complete the digit.


File:figure 3 bengio.png
Figure 3

This is sentences that appear next to the image

Faces

File:figure 4 bengio.png
Figure 4


Comparison

Critique

Mentions SPN