http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Lruan&feedformat=atomstatwiki - User contributions [US]2022-05-17T08:57:56ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Sparse_Rectifier_Neural_Networks&diff=27340deep Sparse Rectifier Neural Networks2015-12-16T21:13:36Z<p>Lruan: /* Potential problems of rectified linear units */</p>
<hr />
<div>= Introduction =<br />
<br />
Machine learning scientists and computational neuroscientists deal with neural networks differently. Machine learning scientists aim to obtain models that are easy to train and easy to generalize, while neuroscientists' objective is to produce useful representation of the scientific data. In other words, machine learning scientists care more about efficiency, while neuroscientists care more about interpretability of the model.<br />
<br />
In this paper they show that two common gaps between computational neuroscience models and machine learning neural network models can be bridged by rectifier activation function. One is between deep networks learnt with and without unsupervised pre-training; the other one is between the activation function and sparsity in neural networks.<br />
<br />
== Biological Plausibility and Sparsity ==<br />
<br />
In the brain, neurons rarely fire at the same time as a way to balance quality of representation and energy conservation. This is in stark contrast to sigmoid neurons which fire at 1/2 of their maximum rate when at zero. A solution to this problem is to use a rectifier neuron which does not fire at it's zero value. This rectifier linear unit is inspired by a common biological model of neuron, the leaky integrate-and-fire model (LIF), proposed by Dayan and Abott<ref><br />
Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems<br />
</ref>. It's function is illustrated in the figure below (middle).<br />
<br />
<gallery><br />
Image:sig_neuron.png|Sigmoid and TANH Neuron<br />
Image:lif_neuron.png|Leaky Integrate Fire Neuron<br />
Image:rect_neuron.png|Rectified Linear Neuron<br />
</gallery><br />
<br />
Given that the rectifier neuron has a larger range of inputs that will be output as zero, it's representation will obviously be more sparse. In the paper, the two most salient advantages of sparsity are:<br />
<br />
- '''Information Disentangling''' As opposed to a dense representation, where every slight input change results in a considerable output change, a the non-zero items of a sparse representation remain almost constant to slight input changes.<br />
<br />
- '''Variable Dimensionality''' A sparse representation can effectively choose how many dimensions to use to represent a variable, since it choose how many non-zero elements to contribute. Thus, the precision is variable, allowing for more efficient representation of complex items.<br />
<br />
Further benefits of a sparse representation and rectified linear neurons in particular are better linear separability (because the input is represented in a higher-dimensional space) and less computational complexity (most units are off and for on-units only a linear functions has to be computed).<br />
<br />
However, it should also be noted that sparsity reduces the capacity of the model because each unit takes part in the representation of fewer values.<br />
<br />
== Advantages of rectified linear units ==<br />
<br />
The rectifier activation function <math>\,max(0, x)</math> allows a network to easily obtain sparse representations since only a subset of hidden units will have a non-zero activation value for some given input and this sparsity can be further increased through regularization methods. Therefore, the rectified linear activation function will utilize the advantages listed in the previous section for sparsity.<br />
<br />
For a given input, only a subset of hidden units in each layer will have non-zero activation values. The rest of the hidden units will have zero and they are essentially turned off. Each hidden unit activation value is then composed of a linear combination of the active (non-zero) hidden units in the previous layer due to the linearity of the rectified linear function. By repeating this through each layer, one can see that the neural network is actually an exponentially increasing number of linear models who share parameters since the later layers will use the same values from the earlier layers. Since the network is linear, the gradient is easy to calculate and compute and travels back through the active nodes without vanishing gradient problem caused by non-linear sigmoid or tanh functions. <br />
<br />
The sparsity and linear model can be seen in the figure the researchers made:<br />
<br />
[[File:RLU.PNG]]<br />
<br />
Each layer is a linear combination of the previous layer.<br />
<br />
== Potential problems of rectified linear units ==<br />
<br />
The zero derivative below zero in the rectified neurons blocks the back-propagation of the gradient during learning. Using a smooth variant of the rectification non-linearity (the softplus activation) this effect was investigated. Surprisingly, the results suggest the hard rectifications performs better. The authors hypothesize that the hard rectification is not a problem as long as the gradient can be propagated along some paths through the network and that the complete shut-off with the hard rectification sharpens the credit attribution to neurons in the learning phase.<br />
<br />
Furthermore, the unbounded nature of the rectification non-linearity can lead to numerical instabilities if activations grow too large. To circumvent this a <math>L_1</math> regularizer is used. Also, if symmetry is required, this can be obtained by using two rectifier units with shared parameters, but requires twice as many hidden units as a network with a symmetric activation function.<br />
<br />
Finally, rectifier networks are subject to ill conditioning of the parametrization. Biases and weights can be scaled in different (and consistent) ways while preserving the same overall network function.<br />
<br />
This paper addresses several difficulties when one wants to use rectifier activation into stacked denoising auto-encoder. The author have experienced several strategies to try to solve these problem.<br />
<br />
1. Use a softplus activation function for the reconstruction layer, along with a quadratic cost: <math> L(x, \theta) = ||x-log(1+exp(f(\tilde{x}, \theta)))||^2</math><br />
<br />
2. scale the rectifier activation values between 0 and 1, then use a sigmoid activation function for the reconstruction layer, along with a cross-entropy reconstruction cost. <math> L(x, \theta) = -xlog(\sigma(f(\tilde{x}, \theta))) - (1-x)log(1-\sigma(f(\tilde{x}, \theta))) </math><br />
<br />
The first strategy yield better generalization on image data and the second one on text data.<br />
<br />
= Experiments =<br />
<br />
Networks with rectifier neurons were applied to the domains of image recognition and sentiment analysis. The datasets for image recognition included both black and white (MNIST, NISTP), colour (CIFAR10) and stereo (NORB) images.<br />
<br />
The datasets for sentiment analysis were taken from opentable.com and Amazon. The task of both was to predict the star rating based off the text blurb of the review.<br />
<br />
== Results ==<br />
<br />
'''Results from image classification'''<br />
[[File:rectifier_res_1.png]]<br />
<br />
'''Results from sentiment classification'''<br />
[[File:rectifier_res_2.png]]<br />
<br />
For image recognition task, they find that there is almost no improvement when using unsupervised pre-training with rectifier activations, contrary to what is experienced using tanh or softplus. However, it achieves best performance when the network is trained Without unsupervised pre-training.<br />
<br />
In the NORB and sentiment analysis cases, the network benefited greatly from pre-training. However, the benefit in NORB diminished as the training set size grew.<br />
<br />
The result from the Amazon dataset was 78.95%, while the state of the art was 73.72%.<br />
<br />
The sparsity achieved with the rectified linear neurons helps to diminish the gap between networks with unsupervised pre-training and no pre-training.<br />
<br />
== Discussion / Criticism ==<br />
<br />
* Rectifier neurons really aren't biologically plausible for a variety of reasons. Namely, the neurons in the cortex do not have tuning curves resembling the rectifier. Additionally, the ideal sparsity of the rectifier networks were from 50 to 80%, while the brain is estimated to have a sparsity of around 95 to 99%.<br />
<br />
* The Sparsity property encouraged by ReLu is a double edged sword, while sparsity encourages information disentangling, efficient variable-size representation, linear separability, increased robustness as suggested by the author of this paper, <ref>Szegedy, Christian, et al. "Going deeper with convolutions." arXiv preprint arXiv:1409.4842 (2014).</ref> argues that computing sparse non-uniform data structures is very inefficient, the overhead and cache-misses would make it computationally expensive to justify using sparse data structures.<br />
<br />
* ReLu does not have vanishing gradient problems<br />
<br />
* ReLu can be prone to "die", in other words it may output same value regardless of what input you give the ReLu unit. This occurs when a large negative bias to the unit is learnt causing the output of the ReLu to be zero, thus getting stuck at zero because gradient at zero is zero. Solutions to mitigate this problem include techniques such as Leaky ReLu and Maxout.<br />
<br />
= Bibliography =<br />
<references /></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Sparse_Rectifier_Neural_Networks&diff=27339deep Sparse Rectifier Neural Networks2015-12-16T21:13:09Z<p>Lruan: /* Potential problems of rectified linear units */</p>
<hr />
<div>= Introduction =<br />
<br />
Machine learning scientists and computational neuroscientists deal with neural networks differently. Machine learning scientists aim to obtain models that are easy to train and easy to generalize, while neuroscientists' objective is to produce useful representation of the scientific data. In other words, machine learning scientists care more about efficiency, while neuroscientists care more about interpretability of the model.<br />
<br />
In this paper they show that two common gaps between computational neuroscience models and machine learning neural network models can be bridged by rectifier activation function. One is between deep networks learnt with and without unsupervised pre-training; the other one is between the activation function and sparsity in neural networks.<br />
<br />
== Biological Plausibility and Sparsity ==<br />
<br />
In the brain, neurons rarely fire at the same time as a way to balance quality of representation and energy conservation. This is in stark contrast to sigmoid neurons which fire at 1/2 of their maximum rate when at zero. A solution to this problem is to use a rectifier neuron which does not fire at it's zero value. This rectifier linear unit is inspired by a common biological model of neuron, the leaky integrate-and-fire model (LIF), proposed by Dayan and Abott<ref><br />
Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems<br />
</ref>. It's function is illustrated in the figure below (middle).<br />
<br />
<gallery><br />
Image:sig_neuron.png|Sigmoid and TANH Neuron<br />
Image:lif_neuron.png|Leaky Integrate Fire Neuron<br />
Image:rect_neuron.png|Rectified Linear Neuron<br />
</gallery><br />
<br />
Given that the rectifier neuron has a larger range of inputs that will be output as zero, it's representation will obviously be more sparse. In the paper, the two most salient advantages of sparsity are:<br />
<br />
- '''Information Disentangling''' As opposed to a dense representation, where every slight input change results in a considerable output change, a the non-zero items of a sparse representation remain almost constant to slight input changes.<br />
<br />
- '''Variable Dimensionality''' A sparse representation can effectively choose how many dimensions to use to represent a variable, since it choose how many non-zero elements to contribute. Thus, the precision is variable, allowing for more efficient representation of complex items.<br />
<br />
Further benefits of a sparse representation and rectified linear neurons in particular are better linear separability (because the input is represented in a higher-dimensional space) and less computational complexity (most units are off and for on-units only a linear functions has to be computed).<br />
<br />
However, it should also be noted that sparsity reduces the capacity of the model because each unit takes part in the representation of fewer values.<br />
<br />
== Advantages of rectified linear units ==<br />
<br />
The rectifier activation function <math>\,max(0, x)</math> allows a network to easily obtain sparse representations since only a subset of hidden units will have a non-zero activation value for some given input and this sparsity can be further increased through regularization methods. Therefore, the rectified linear activation function will utilize the advantages listed in the previous section for sparsity.<br />
<br />
For a given input, only a subset of hidden units in each layer will have non-zero activation values. The rest of the hidden units will have zero and they are essentially turned off. Each hidden unit activation value is then composed of a linear combination of the active (non-zero) hidden units in the previous layer due to the linearity of the rectified linear function. By repeating this through each layer, one can see that the neural network is actually an exponentially increasing number of linear models who share parameters since the later layers will use the same values from the earlier layers. Since the network is linear, the gradient is easy to calculate and compute and travels back through the active nodes without vanishing gradient problem caused by non-linear sigmoid or tanh functions. <br />
<br />
The sparsity and linear model can be seen in the figure the researchers made:<br />
<br />
[[File:RLU.PNG]]<br />
<br />
Each layer is a linear combination of the previous layer.<br />
<br />
== Potential problems of rectified linear units ==<br />
<br />
The zero derivative below zero in the rectified neurons blocks the back-propagation of the gradient during learning. Using a smooth variant of the rectification non-linearity (the softplus activation) this effect was investigated. Surprisingly, the results suggest the hard rectifications performs better. The authors hypothesize that the hard rectification is not a problem as long as the gradient can be propagated along some paths through the network and that the complete shut-off with the hard rectification sharpens the credit attribution to neurons in the learning phase.<br />
<br />
Furthermore, the unbounded nature of the rectification non-linearity can lead to numerical instabilities if activations grow too large. To circumvent this a <math>L_1</math> regularizer is used. Also, if symmetry is required, this can be obtained by using two rectifier units with shared parameters, but requires twice as many hidden units as a network with a symmetric activation function.<br />
<br />
Finally, rectifier networks are subject to ill conditioning of the parametrization. Biases and weights can be scaled in different (and consistent) ways while preserving the same overall network function.<br />
<br />
This paper addresses several difficulties when one wants to use rectifier activation into stacked denoising auto-encoder. The author have experienced several strategies to try to solve these problem. <br />
1. Use a softplus activation function for the reconstruction layer, along with a quadratic cost: <math> L(x, \theta) = ||x-log(1+exp(f(\tilde{x}, \theta)))||^2</math><br />
2. scale the rectifier activation values between 0 and 1, then use a sigmoid activation function for the reconstruction layer, along with a cross-entropy reconstruction cost. <math> L(x, \theta) = -xlog(\sigma(f(\tilde{x}, \theta))) - (1-x)log(1-\sigma(f(\tilde{x}, \theta))) </math><br />
<br />
The first strategy yield better generalization on image data and the second one on text data.<br />
<br />
= Experiments =<br />
<br />
Networks with rectifier neurons were applied to the domains of image recognition and sentiment analysis. The datasets for image recognition included both black and white (MNIST, NISTP), colour (CIFAR10) and stereo (NORB) images.<br />
<br />
The datasets for sentiment analysis were taken from opentable.com and Amazon. The task of both was to predict the star rating based off the text blurb of the review.<br />
<br />
== Results ==<br />
<br />
'''Results from image classification'''<br />
[[File:rectifier_res_1.png]]<br />
<br />
'''Results from sentiment classification'''<br />
[[File:rectifier_res_2.png]]<br />
<br />
For image recognition task, they find that there is almost no improvement when using unsupervised pre-training with rectifier activations, contrary to what is experienced using tanh or softplus. However, it achieves best performance when the network is trained Without unsupervised pre-training.<br />
<br />
In the NORB and sentiment analysis cases, the network benefited greatly from pre-training. However, the benefit in NORB diminished as the training set size grew.<br />
<br />
The result from the Amazon dataset was 78.95%, while the state of the art was 73.72%.<br />
<br />
The sparsity achieved with the rectified linear neurons helps to diminish the gap between networks with unsupervised pre-training and no pre-training.<br />
<br />
== Discussion / Criticism ==<br />
<br />
* Rectifier neurons really aren't biologically plausible for a variety of reasons. Namely, the neurons in the cortex do not have tuning curves resembling the rectifier. Additionally, the ideal sparsity of the rectifier networks were from 50 to 80%, while the brain is estimated to have a sparsity of around 95 to 99%.<br />
<br />
* The Sparsity property encouraged by ReLu is a double edged sword, while sparsity encourages information disentangling, efficient variable-size representation, linear separability, increased robustness as suggested by the author of this paper, <ref>Szegedy, Christian, et al. "Going deeper with convolutions." arXiv preprint arXiv:1409.4842 (2014).</ref> argues that computing sparse non-uniform data structures is very inefficient, the overhead and cache-misses would make it computationally expensive to justify using sparse data structures.<br />
<br />
* ReLu does not have vanishing gradient problems<br />
<br />
* ReLu can be prone to "die", in other words it may output same value regardless of what input you give the ReLu unit. This occurs when a large negative bias to the unit is learnt causing the output of the ReLu to be zero, thus getting stuck at zero because gradient at zero is zero. Solutions to mitigate this problem include techniques such as Leaky ReLu and Maxout.<br />
<br />
= Bibliography =<br />
<references /></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=scene_Parsing_with_Multiscale_Feature_Learning,_Purity_Trees,_and_Optimal_Covers_Machines&diff=27338scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines2015-12-16T20:05:00Z<p>Lruan: /* Pre-processing */</p>
<hr />
<div>= Introduction =<br />
<br />
This paper<ref><br />
Farabet, Clement, et al. [http://arxiv.org/pdf/1202.2160v2.pdf "Scene parsing with multiscale feature learning, purity trees, and optimal covers."] arXiv preprint arXiv:1202.2160 (2012).<br />
</ref> presents an approach to full scene labelling (FSL). This is the task of giving a label to each pixel in an image corresponding to which category of object it belongs to. FSL involves solving the problems of detection, segmentation, recognition, and contextual integration simultaneously. One of the main obstacles of FSL is that the information required for labelling a particular pixel could come from very distant pixels as well as their labels. This distance often depends on the particular label as well (e.g. the presence of a wheel might mean there is a vehicle nearby, while an object like the sky or water could span the entire image, and figuring out to which class a particular blue pixel belongs could be challenging).<br />
<br />
= Overview =<br />
<br />
The proposed method for FSL works by first computing a tree of segments from a graph of pixel dissimilarities. A set of dense feature vectors is then computed, encoding regions of multiple sizes centered on each pixel. Feature vectors are aggregated and fed to a classifier which estimates the distribution of object categories in a segment. A subset of tree nodes that cover the image are selected to maximize the average "purity" of the class distributions (i.e. maximizing the likelihood that each segment will contain a single object). The convolutional network feature extractor is trained end-to-end from raw pixels, so there is no need for engineered features.<br />
<br />
There are five main ingredients to this new method for FSL:<br />
<br />
# Trainable, dense, multi-scale feature extraction<br />
# Segmentation tree<br />
# Regionwise feature aggregation<br />
# Class histogram estimation<br />
# Optimal purity cover<br />
<br />
The three main contributions of this paper are:<br />
<br />
# Using a multi-scale convolutional net to learn good features for region classification<br />
# Using a class purity criterion to decide if a segment contains a single object, as opposed to several objects, or part of an object<br />
# An efficient procedure to obtain a cover that optimizes the overall class purity of a segmentation<br />
<br />
= Previous Work =<br />
<br />
Most previous methods of FSL rely on MRFs, CRFs, or other types of graphical models to ensure consistency in the labeling and to account for context. This is typically done using a pre-segmentation into super-pixels or other segment candidates. Features and categories are then extracted from individual segments and combinations of neighboring segments.<br />
<br />
Using trees allows the use of fast inference algorithms based on graph cuts or other methods. In this paper, an innovative method based on finding a set of tree nodes that cover the images while minimizing some criterion is used.<br />
<br />
= Model =<br />
<br />
This model relies on two complementary image representations. In the first representation, the image is seen as a point in a high-dimensional space, and we seek to find a transform <math>f: \mathbb{R}^P \rightarrow \mathbb{R}^Q</math> that maps these images into a space in which each pixel can be assigned a label using a simple linear classifier. In the second representation, the image is seen as an edge-weighted graph, on which a hierarchy of segmentations/clusterings can be constructed. This representation yields a natural abstraction of the original pixel grid, and provides a hierarchy of observation levels for all the objects in the image. The full model is shown in the diagram below. It is an end-to-end trainable model for scene parsing.<br />
<br />
[[File:SceneModelDiagram.png]]<br />
<br />
== Pre-processing ==<br />
<br />
Before being put into the Convolutional Neural Network (CNN) multiple scaled versions of the image are generated. The set of these scaled images is called a ''pyramid''. There were three different scale outputs of the image created, in a similar manner shown in the picture below<br />
<br />
[[File:Image_pyramid.png ]]<br />
<br />
The scaling can be done by different transforms; the paper suggests to use the Laplacian transform. The Laplacian is the sum of partial second derivatives <math>\nabla^2 f = \frac{\partial^2 f}{\partial x^2} + \frac{\partial^2 f}{\partial y^2}</math>. A two-dimensional discrete approximation is given by the matrix <math>\left[\begin{array}{ccc}0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0\end{array}\right]</math>.<br />
<br />
This first step typically suffers from two main problems: (1) Due to the fact that window sizes are different, in some window cases, an object is not properly centered and scaled. Therefore, it offers a poor observation to predict the class of underlying object. (2) integrating a large context involves increasing the dimensionality P of the input. Then it is necessary to enforce some invariance in the function f itself.<br />
<br />
== Network Architecture ==<br />
<br />
More holistic tasks, such as full-scene understanding (pixel-wise labeling, or any dense feature estimation) require the system to model complex interactions at the scale of complete images, not simply within a patch. In this problem the dimensionality becomes unmanageable: for a typical image of 256×256 pixels, a naive neural network would require millions of parameters, and a naive convolutional network would require filters that are unreasonably large to view enough context. The multiscale convolutional network overcomes these limitations by extending the concept of weight replication to the scale space. The more scales used to jointly train the models, the better the representation becomes for all scales. Using the same function to extract features at each scale is justified because the image content is scale invariant in principle. The authors noted that they observed worse performance when the weight sharing was removed.<br />
<br />
== Post-Processing ==<br />
<br />
In this model the sampling is done using an elastic max-pooling function, which remaps input patterns of arbitrary size into a fixed G×G grid (in this case a 5x5 grid was used). This grid can be seen as a highly invariant representation that encodes spatial relations between an object’s attributes/parts. This representation is denoted O<sub>k</sub> and is shown in the diagram below. With this encoding elongated or ill-shaped objects are nicely handled. The dominant features are also used to represent the object, and when combined with background subtraction, these features represent good basis functions to recognize the underlying object. These features are then associated to the corresponding areas of the tree segmentation of the image (generated by creating a minimum spanning tree from the dissimilarity graph of neighboring pixels) for optimal cover calculation.<br />
<br />
[[File:SceneGridFeatures.png]]<br />
<br />
One of the important features of this model is its method for optimal cover, which is detailed in the diagram below. The leaf nodes represent pixels in the image and a subset of tree nodes are selected whose aggregate children span the entire image. The nodes are selected to minimize the average "impurity" of the class distribution (i.e. the entropy). The cover attempts to find an overall consisten segmentation, where each selected node corresponds to a particular class labelling for itself and all of its unselected children.<br />
<br />
[[File:SceneOptimalCover.png]]<br />
<br />
<br />
== Training ==<br />
<br />
Training is done in a two step process. First, the low level feature extractor <math>f_s</math> is trained to produce features that are maximally discriminative. Then, the classifier <math>c</math> is trained to predict the distriubiton of casses in a component. The feature vectors are obtained by concatenating the network outputs for different scales of the multiscale pyramid. To train for them the loss function<br />
<math>L_{\mathrm{cat}} = - \sum_{i \in \mathrm{pixels}, a \in \mathrm{classes}} c_{i,a} \ln(\hat{c}_{i,a})</math><br />
is used, where <math>c_i</math> is the true (classification) target vector and <math>\hat{c}_i</math> the prediction from a linear classifier (which is only used in this step and will be discarded later).<br />
<br />
After training parameters for the feature extraction, parameters of the actual classifier is trained my minimizing the Kullback-Leibler-divergence (KL-divergence) between the true distribution of labels in each component and the prediction from the classifier. The KL-divergence is a measure of the difference between two probability distributions.<br />
<br />
= Experiments =<br />
<br />
For all experiments, a 2-stage convolutional network was used. The input is a 3-channel image, and it is transformed into a 16-dimensional feature map, using a bank of 16 7x7 filters followed by tanh units. This feature map is then pooled using a 2x2 max-pooling layer. The second layer transforms the 16-dimensional feature map into a 64-dimensional feature map, with each component being produced by a combination of 8 7x7 filters (for an effective total of 512 filters), followed by tanh units. This map is also pooled using a 2x2 max-pooling layer. This 64-dimensional feature map is transformed into a 256-dimensional feature map by using a combination of 16 7x7 filters (2048 filters).<br />
<br />
The network is applied to a locally normalized Laplacian pyramid constructed on the input image. The pyramid contains three rescaled versions of the input: 320x240, 160x120, and 80x60. All of the inputs are properly padded and the outputs of each of the three networks are upsampled and concatenated to produce a 768-dimensional feature vector map (256x3). The network is trained on all three scales in parallel.<br />
<br />
A simple grid search was used to find the best learning rate and regularization parameters (weight decay). A holdout of 10% of the training data was used as a validation set during the parameter search. For both datasets, jitter was used to artificially expand the size of the training data, to try to allow features to not overfit irrelevant biases present in the data. This jitter included horizontal flipping, and rotations between -8 and 8 degrees.<br />
<br />
The hierarchy used to find the optimal cover is a constructed on the raw image gradient, based on a standard volume criterion<ref><br />
F. Meyer and L. Najman. [http://onlinelibrary.wiley.com/doi/10.1002/9781118600788.ch9/summary "Segmentation, minimum spanning tree and hierarchies."] In L. Najman and H. Talbot, editors, Mathematical Morphology: from theory to application, chapter 9, pages 229–261. ISTE-Wiley, London, 2010.<br />
</ref><ref><br />
J. Cousty and L. Najman. [http://link.springer.com/chapter/10.1007/978-3-642-21569-8_24 "Incremental algorithm for hierarchical minimum spanning forests and saliency of watershed cuts."] In 10th International Symposium on Mathematical Morphology (ISMM’11), LNCS, 2011.<br />
</ref>, completed by removing non-informative small components (less than 100 pixels). Traditionally segmentation methods use a partition of segments (i.e. finding an optimal cut in the tree) rather than a cover. A number of graph cut methods were tried, but the results were systematically worse than the optimal cover method.<br />
<br />
Two sampling methods for learning the multiscale features were tried on each dataset. One uses the natural frequencies of each class in the dataset, while the other balances them so that an equal number of each class is shown to the network. The results from each of these methods varied with the dataset used and are reported in the tables below. The authors only included the results for the frequency balancing method for the Stanford Background dataset as it consistently gave better results, but it could still be useful to have the results from the other method to help guide future work. Training with balanced frequencies allows better discrimination of small objects, and although it tends to have lower overall pixel-wise accuracy, it performs better from a recognition point of view. This observation can be seen in the tables below. The per-pixel accuracy for frequency balancing in the Barcelona dataset is quite poor, which the authors attribute by the fact that the dataset has a large amount of classes with very few training examples, leading to overfitting when trying to model them in this manner.<br />
<br />
= Results =<br />
<br />
[[File:SceneResultTableStanford.png]]<br />
<br />
[[File:SceneResultTableSIFT.png]]<br />
<br />
[[File:SceneResultTableBarcelona.png]]<br />
<br />
[[File:SceneResultPictures.png]]<br />
<br />
=References=<br />
<references /></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=extracting_and_Composing_Robust_Features_with_Denoising_Autoencoders&diff=27311extracting and Composing Robust Features with Denoising Autoencoders2015-12-14T20:34:31Z<p>Lruan: /* Analysis of the Denoising Autoencoder */</p>
<hr />
<div>= Introduction =<br />
This Paper explores a new training principle for unsupervised learning<br />
of a representation based on the idea of making the learned representations<br />
robust to partial corruption of the input pattern. This approach can<br />
be used to train autoencoders, and these denoising autoencoders can be<br />
stacked to initialize deep architectures. The algorithm can be motivated<br />
from a manifold learning and information theoretic perspective or from a<br />
generative model perspective.<br />
The proposed system is similar to a standard auto-encoder, which is trained with the objective function to learn a hidden representation which allows it to reconstruct its input. The difference between these two models is that the model is trained to reconstruct the original input from a corrupted version, generated by adding random noise to the data. This will result in extracting useful features.<br />
== Motivation ==<br />
<br />
The approach is based on the use of an unsupervised<br />
training criterion to perform a layer-by-layer initialization. The procedure is as follows :<br />
Each layer is at first trained to produce a higher level (hidden) representation of the observed patterns,<br />
based on the representation it receives as input from the layer below, by<br />
optimizing a local unsupervised criterion. Each level produces a representation<br />
of the input pattern that is more abstract than the previous level’s, because it<br />
is obtained by composing more operations. This initialization yields a starting<br />
point, from which a global fine-tuning of the model’s parameters is then performed<br />
using another training criterion appropriate for the task at hand.<br />
<br />
This process gives better solutions than the one obtained by random initializations<br />
<br />
= The Denoising Autoencoder =<br />
<br />
A Denoising Autoencoder reconstructs<br />
a clean “repaired” input from a corrupted, partially destroyed one. This<br />
is done by first corrupting the initial input <math>x</math> to get a partially destroyed version<br />
<math>\tilde{x}</math> by means of a stochastic mapping. This means<br />
that the autoencoder must learn to compute a representation<br />
that is informative of the original input even<br />
when some of its elements are missing. This technique<br />
was inspired by the ability of humans to have an appropriate<br />
understanding of their environment even in<br />
situations where the available information is incomplete<br />
(e.g. when looking at an object that is partly<br />
occluded). In this paper the noise is added by randomly zeroing a fixed number, <math>v_d</math>, of components and leaving the rest untouched.<br />
<br />
Thus the objective function can be described as<br />
[[File:W1.png]]<br />
<br />
The objective function minimized by<br />
stochastic gradient descent becomes: <br />
[[File:W2.png]]<br />
<br />
where the loss function is the cross entropy of the model<br />
The denoising autoencoder can be shown in the figure as <br />
<br />
[[File:W3.png]]<br />
<br />
= Layer-wise Initialization and Fine Tuning =<br />
<br />
While training the denoising autoencoder k-th layer used as<br />
input for the (k + 1)-th, and the (k + 1)-th layer trained after the k-th has been<br />
trained. After a few layers have been trained, the parameters are used as initialization<br />
for a network optimized with respect to a supervised training criterion.<br />
This greedy layer-wise procedure has been shown to yield significantly better<br />
local minima than random initialization of deep networks,<br />
achieving better generalization on a number of tasks.<br />
<br />
= Analysis of the Denoising Autoencoder =<br />
== Manifold Learning Perspective ==<br />
<br />
<br />
The process of mapping a corrupted example to an uncorrupted one can be<br />
visualized in Figure 2, with a low-dimensional manifold <math>\mathcal{M}</math> near which the data<br />
concentrate. We learn a stochastic operator <math>p(X|\tilde{X})</math> that maps an <math>\tilde{X}</math> to an <math>X\,</math>.<br />
<br />
<br />
[[File:q4.png]]<br />
<br />
Since the corrupted points <math>\tilde{X}</math> will likely not be on <math>\mathcal{M}</math>, the learned map <math>p(X|\tilde{X})</math> is able to determine how to transform points away from <math>\mathcal{M}</math> into points on <math>\mathcal{M}</math>.<br />
<br />
The denoising autoencoder can thus be seen as a way to define and learn a<br />
manifold. The intermediate representation <math>Y = f(X)</math> can be interpreted as a<br />
coordinate system for points on the manifold (this is most clear if we force the<br />
dimension of <math>Y</math> to be smaller than the dimension of <math>X</math>). More generally, one can<br />
think of <math>Y = f(X)</math> as a representation of <math>X</math> which is well suited to capture the<br />
main variations in the data, i.e., on the manifold. When additional criteria (such<br />
as sparsity) are introduced in the learning model, one can no longer directly view<br />
<math>Y = f(X)</math> as an explicit low-dimensional coordinate system for points on the<br />
manifold, but it retains the property of capturing the main factors of variation<br />
in the data.<br />
<br />
== Stochastic Operator Perspective ==<br />
<br />
The denoising autoencoder can also be seen as corresponding to a semi-parametric model that can be sampled from. Define the joint distribution as follows: <br />
<br />
:<math>p(X, \tilde{X}) = p(\tilde{X}) p(X|\tilde{X}) = q^0(\tilde{X}) p(X|\tilde{X}) </math> <br />
<br />
from the stochastic operator <math>p(X | \tilde{X})</math>, with <math>q^0\,</math> being the empirical distribution.<br />
<br />
Using the Kullback-Leibler divergence, defined as:<br />
<br />
:<math>\mathbb{D}_{KL}(p|q) = \mathbb{E}_{p(X)} \left(\log\frac{p(X)}{q(X)}\right) </math><br />
<br />
then minimizing <math>\mathbb{D}_{KL}(q^0(X, \tilde{X}) | p(X, \tilde{X})) </math> yields the originally-formulated denoising criterion. Furthermore, as this objective is minimized, the marginals of <math>\,p</math> approach those of <math>\,q^0</math>, i.e. <math> p(X) \rightarrow q^0(X)</math>. Then, if <math>\,p</math> is expanded in the following way:<br />
<br />
:<math> p(X) = \frac{1}{n}\sum_{i=1}^n \sum_{\tilde{\mathbf{x}}} p(X|\tilde{X} = \tilde{\mathbf{x}}) q_{\mathcal{D}}(\tilde{\mathbf{x}} | \mathbf{x}_i) </math><br />
<br />
it becomes clear that the denoising autoencoder learns a semi-parametric model that can be sampled from (since <math>p(X)</math> above is easy to sample from). <br />
<br />
== Information Theoretic Perspective ==<br />
<br />
It is also possible to adopt an information theoretic perspective. The representation of the autonencoder should retain as much information as possible while at the same time certain properties, like a limited complexity, are imposed on the marginal distribution. This can be expressed as an optimization of <math>\arg\max_{\theta} \{I(X;Y) + \lambda \mathcal{J}(Y)\}</math> where <math>I(X; Y)</math> is the mutual information between an input sample <math>X</math> and the hidden representation <math>Y</math> and <math>\mathcal{J}</math> is a functional expressing the preference over the marginal. The hyper-parameter <math>\lambda</math> controls the trade-off between maximazing the mutual information and keeping the marginal simple.<br />
<br />
Note that this reasoning also applies to the basic autoencoder, but the denoising autoencoder maximizes the mutual information between <math>X</math> and <math>Y</math> while <math>Y</math> can also be a function of corrupted input.<br />
<br />
== Generative Model Perspective ==<br />
<br />
This section tries to recover the training criterion for denoising autoencoder. The section of 'information theoretic Perspective' is equivalent to maximizing a variational bound on a particular generative model. The final training criterion found is to maximize <math> \bold E_{q^0(\tilde{x})}[L(q^0, \tilde{X})] </math>, where <math> L(q^0, \tilde{X}) = E_{q^0(X,Y | \tilde{X})}[log\frac{p(X, \tilde{X}, Y)}{q^0(X, Y | \tilde(X))}] </math><br />
<br />
= Experiments =<br />
The Input contains different<br />
variations of the MNIST digit classification problem, with added factors of<br />
variation such as rotation (rot), addition of a background composed of random<br />
pixels (bg-rand) or made from patches extracted from a set of images (bg-img), or<br />
combinations of these factors (rot-bg-img). These variations render the problems particularly challenging for current generic learning algorithms. Each problem<br />
is divided into a training, validation, and test set (10000, 2000, 50000 examples<br />
respectively). A subset of the original MNIST problem is also included with the<br />
same example set sizes (problem basic). The benchmark also contains additional<br />
binary classification problems: discriminating between convex and non-convex<br />
shapes (convex), and between wide and long rectangles (rect, rect-img).<br />
Neural networks with 3 hidden layers initialized by stacking denoising autoencoders<br />
(SdA-3), and fine tuned on the classification tasks, were evaluated<br />
on all the problems in this benchmark. Model selection was conducted following<br />
a similar procedure as Larochelle et al. (2007). Several values of hyper<br />
parameters (destruction fraction ν, layer sizes, number of unsupervised training<br />
epochs) were tried, combined with early stopping in the fine tuning phase. For<br />
each task, the best model was selected based on its classification performance<br />
on the validation set.<br />
The results can be reported in the following table.<br />
[[File:W5.png]]<br />
<br />
The filter obtained by training are shown the the figure below<br />
<br />
<br />
[[File:Qq3.png]]<br />
<br />
= Conclusion and Future Work =<br />
<br />
The paper shows a denoising Autoencoder which was motivated by the goal of<br />
learning representations of the input that are robust to small irrelevant changes<br />
in input. Several perspectives also help to motivate it from a manifold learning<br />
perspective and from the perspective of a generative model.<br />
This principle can be used to train and stack autoencoders to initialize a<br />
deep neural network. A series of image classification experiments were performed<br />
to evaluate this new training principle. The empirical results support<br />
the following conclusions: unsupervised initialization of layers with an explicit<br />
denoising criterion helps to capture interesting structure in the input distribution.<br />
This in turn leads to intermediate representations much better suited for<br />
subsequent learning tasks such as supervised classification. The experimental<br />
results with Deep Belief Networks (whose layers are initialized as RBMs) suggest<br />
that RBMs may also encapsulate a form of robustness in the representations<br />
they learn, possibly because of their stochastic nature, which introduces noise<br />
in the representation during training<br />
<br />
= References =<br />
<br />
Bengio, Y. (2007). Learning deep architectures for AI (Technical Report 1312).<br />
Universit´e de Montr´eal, dept. IRO.<br />
<br />
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layerwise<br />
training of deep networks. Advances in Neural Information Processing<br />
Systems 19 (pp. 153–160). MIT Press.<br />
<br />
Bengio, Y., & Le Cun, Y. (2007). Scaling learning algorithms towards AI. In<br />
L. Bottou, O. Chapelle, D. DeCoste and J. Weston (Eds.), Large scale kernel<br />
machines. MIT Press.<br />
<br />
Doi, E., Balcan, D. C., & Lewicki, M. S. (2006). A theoretical analysis of<br />
robust coding over noisy overcomplete channels. In Y. Weiss, B. Sch¨olkopf<br />
and J. Platt (Eds.), Advances in neural information processing systems 18,<br />
307–314. Cambridge, MA: MIT Press.<br />
<br />
Doi, E., & Lewicki, M. S. (2007). A theory of retinal population coding. NIPS<br />
(pp. 353–360). MIT Press.<br />
<br />
Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant<br />
representations over learned dictionaries. IEEE Transactions on Image Processing,<br />
15, 3736–3745.<br />
<br />
Gallinari, P., LeCun, Y., Thiria, S., & Fogelman-Soulie, F. (1987). Memoires<br />
associatives distribuees. Proceedings of COGNITIVA 87. Paris, La Villette<br />
<br />
Hammond, D., & Simoncelli, E. (2007). A machine learning framework for adaptive<br />
combination of signal denoising methods. 2007 International Conference<br />
on Image Processing (pp. VI: 29–32).<br />
<br />
Hinton, G. (1989). Connectionist learning procedures. Artificial Intelligence,<br />
40, 185–234.<br />
Hinton, G., & Salakhutdinov, R. (2006). Reducing the dimensionality of data<br />
with neural networks. Science, 313, 504–507.<br />
<br />
Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for<br />
deep belief nets. Neural Computation, 18, 1527–1554.<br />
<br />
Hopfield, J. (1982). Neural networks and physical systems with emergent collective<br />
computational abilities. Proceedings of the National Academy of Sciences,<br />
USA, 79.<br />
<br />
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007).<br />
An empirical evaluation of deep architectures on problems with many factors<br />
of variation. Twenty-fourth International Conference on Machine Learning<br />
(ICML’2007).<br />
<br />
LeCun, Y. (1987). Mod`eles connexionistes de l’apprentissage. Doctoral dissertation,<br />
Universit´e de Paris VI.<br />
<br />
Lee, H., Ekanadham, C., & Ng, A. (2008). Sparse deep belief net model for visual<br />
area V2. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in<br />
neural information processing systems 20. Cambridge, MA: MIT Press.<br />
<br />
McClelland, J., Rumelhart, D., & the PDP Research Group (1986). Parallel<br />
distributed processing: Explorations in the microstructure of cognition, vol. 2.<br />
Cambridge: MIT Press.<br />
<br />
Memisevic, R. (2007). Non-linear latent factor models for revealing structure<br />
in high-dimensional data. Doctoral dissertation, Departement of Computer<br />
Science, University of Toronto, Toronto, Ontario, Canada.<br />
<br />
Ranzato, M., Boureau, Y.-L., & LeCun, Y. (2008). Sparse feature learning for<br />
deep belief networks. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.),<br />
Advances in neural information processing systems 20. Cambridge, MA: MIT<br />
Press.<br />
<br />
Ranzato, M., Poultney, C., Chopra, S., & LeCun, Y. (2007). Efficient learning<br />
of sparse representations with an energy-based model. Advances in Neural<br />
Information Processing Systems (NIPS 2006). MIT Press.<br />
<br />
Roth, S., & Black, M. (2005). Fields of experts: a framework for learning image<br />
priors. IEEE Conference on Computer Vision and Pattern Recognition (pp.<br />
860–867).</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=distributed_Representations_of_Words_and_Phrases_and_their_Compositionality&diff=27305distributed Representations of Words and Phrases and their Compositionality2015-12-14T19:24:50Z<p>Lruan: /* Other techniques for sentence representation */</p>
<hr />
<div>= Introduction =<br />
<br />
This paper<ref><br />
Mikolov, Tomas, et al. [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf "Distributed representations of words and phrases and their compositionality."] Advances in neural information processing systems. 2013.<br />
</ref> presents several extensions of the Skip-gram model introduced by Mikolov et al. <ref name=MiT> Mikolov, Tomas, ''et al'' [http://arxiv.org/pdf/1301.3781v3.pdf"Efficient Estimation of Word Representations in Vector Space"] in ICLR Workshop, (2013). </ref>. The Skip-gram model is an efficient method for learning high-quality vector representations of words from large amounts of unstructured text data. The word representations computed using this model are very interesting because the learned vectors explicitly encode many linguistic regularities and patterns. Somewhat surprisingly, many of these patterns can be represented as linear translations. For example, the result of a vector calculation vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector. The authors of this paper show that subsampling of frequent words during training results in a significant speedup and improves accuracy of the representations of less frequent words. In addition, a simplified variant of Noise Contrastive Estimation (NCE) <ref name=GuM><br />
Gutmann, Michael U, ''et al'' [http://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann12JMLR.pdf"Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics"] in The Journal ofMachine Learning Research, (2012).<br />
</ref>. for training the Skip-gram model is presented that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax that was used in the prior work <ref name=MiT></ref>. It also shows that a non-obvious degree of language understanding can be obtained by using basic mathematical operations on the word vector representations. For example, vec(“Russia”) + vec(“river”) is close to vec(“Volga River”), and vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”).<br />
<br />
= The Skip-gram Model =<br />
<br />
The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document. More formally, given a sequence of training words <math>w_1, w_2,..., w_T</math> the objective of the Skip-gram model is to maximize the average log probability:<br />
<br />
<math><br />
\frac{1}{T} \sum_{t=1}^{T} \sum_{-c\leq j\leq c} log(p(w_{t+j}|w_t))<br />
</math><br />
<br /><br />
<br /><br />
where <math>c</math> is the size of the training context (which can be a function of the center word <math>w_t</math>) and <math>p(w_{t+j}|w_t)</math> is defined using softmax function:<br />
<br />
<math><br />
p(w_O|w_I) = \frac{exp ({v'_{W_O}}^T v_{W_I})}{\sum{w=1}^{W} exp ({v'_{W}}^T v_{W_I})}<br />
</math><br />
<br />
Here, <math>v_w</math> and <math>v'_w</math> are the “''input''” and “''output''” vector representations of <math>w</math>, and <math>W</math> is the number of words in the vocabulary.<br />
<br />
== Hierarchical Softmax ==<br />
<br />
Hierarchical Softmax is a computationally efficient approximation of the full softmax <ref name=MoF><br />
Morin, Frederic, ''et al'' [http://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf"Hierarchical probabilistic neural network language model"] in Proceedings of the international workshop on artificial intelligence and statistics, (2015).<br />
</ref>. Hierarchical Softmax evaluate only about <math>log_2(W)</math> output nodes instead of evaluating <math>W</math> nodes in the neural network to obtain the probability distribution.<br />
<br />
The hierarchical softmax uses a binary tree representation of the output layer with the <math>W</math> words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes. These define a random walk that assigns probabilities to words.<br />
<br />
Let <math>n(w,j)</math> be the <math>j^{th}</math> node on the path from the root to <math>w</math>, and let <math>L(w)</math> be the length of this path, so <math>n(w,1) = root</math> and <math>n(w,L(w)) = w</math>. In addition, for any inner node <math>n</math>, let <math>ch(n)</math> be an arbitrary fixed child of <math>n</math> and let <math>[[x]]</math> be 1 if <math>x</math> is true and -1 otherwise. Then the hierarchical softmax defines <math>p(w_O|w_I )</math> as follows:<br />
<br />
<math><br />
p(w|w_I) = \prod_{j=1}^{L(w)-1} \sigma ([[n(w,j+1)=ch(n(w,j))]]{v'_{n(w,j)}}^T v_{W_I}) <br />
</math><br />
<br />
where<br />
<br />
<math><br />
\sigma (x)=\frac{1}{1+exp(-x)}<br />
</math><br />
<br />
In this paper, a binary Huffman tree is used as the structure for the hierarchical softmax because it assigns short codes to the frequent words which results in fast training. It has been observed before that grouping words together by their frequency works well as a very simple speedup technique for the neural network based language models <ref name=MiT></ref><ref name=MiT2><br />
Mikolov, Tomas, ''et al'' [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5947611"Extensions of recurrent neural network language model."] in Acoustics, Speech and Signal Processing (ICASSP), (2011).<br />
</ref>.<br />
<br />
== Negative Sampling==<br />
<br />
Noise Contrastive Estimation (NCE) is an alternative to the hierarchical softmax. NCE indicates that a good model should be able to differentiate data from noise by means of logistic regression. While NCE can be shown to approximately maximize the log probability of the softmax, the Skipgram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality. Negative sampling (NEG) is defined by the objective:<br />
<br />
<math><br />
log \sigma ({v'_{W_O}}^T v_{W_I})+\sum_{i=1}^{k} \mathbb{E}_{w_i\sim P_n(w)}[log \sigma ({-v'_{W_i}}^T v_{W_I})]<br />
</math><br />
<br />
The main difference between the Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples. And while NCE approximately maximizes the log probability of the softmax, this property is not important for our application.<br /><br />
Both NCE and NEG have the noise distribution <math>P_n(w)</math> as a free parameter. We investigated a number of choices for <math>P_n(w)</math> and found that the unigram distribution <math>U(w)</math> raised to the 3/4rd power (i.e., <math>U(w)^{3/4}/Z)</math> outperformed significantly the unigram and the uniform distributions, for both NCE and NEG on every task we tried including language modeling.<br />
<br />
==Subsampling of Frequent Words==<br />
In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g., “in”, “the”, and “a”). Such words usually provide less information about the surrounding words that rarer words (i.e., "the" provides little information about the next word because it co-occurs with a huge number of words), and the representation of the frequent word will be unlikely to change significantly after many iterations. <br />
<br />
To counter the imbalance between the rare and frequent words, a simple subsampling approach is used. Each word <math>w_i</math> in the training set is discarded with probability computed by the formula:<br />
<br />
<math><br />
P(w_i)=1-\sqrt{\frac{1}{f(w_i)}}<br />
</math><br />
<br />
where <math>f(w_i)</math> is the frequency of word <math>w_i</math> and <math>t</math> is a chosen threshold, typically around <math>10^{−5}</math>.<br />
<br />
= Empirical Results=<br />
<br />
The Hierarchical Softmax (HS), Noise Contrastive Estimation, Negative Sampling, and subsampling of the training words are evaluated with the help of the analogical reasoning task1 <ref name=MiT></ref>. The task consists of analogies such as “Germany” : “Berlin” :: “France” : ?, which are solved by finding a vector ''x'' such that vec(''x'') is closest to vec(“Berlin”) - vec(“Germany”) + vec(“France”) according to the cosine distance. This specific example is considered to have been answered correctly if ''x'' is “Paris”. The task has two broad categories: the syntactic analogies (such as “quick” : “quickly” :: “slow” : “slowly”) and the semantic analogies, such as the country to capital city relationship.<br />
<br />
For training the Skip-gram models, a large dataset consisting of various news articles is used (an internal Google dataset with one billion words). All words that occurred less than 5 times in the training data were discarded, which resulted in a vocabulary of size 692K. The performance of various Skip-gram models on the word analogy test set is reported in Table 1. The table shows that Negative Sampling outperforms the Hierarchical Softmax on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation. The subsampling of the frequent words improves the training speed several times and makes the word representations significantly more accurate.<br />
<br />
<center><br />
[[File:Tb_1.PNG | frame | center |Table 1. Accuracy of various Skip-gram 300-dimensional models on the analogical reasoning task as defined in <ref name=MiT></ref>. NEG-''k'' stands for Negative Sampling with ''k'' negative samples for each positive sample; NCE stands for Noise Contrastive Estimation and HS-Huffman stands for the Hierarchical Softmax with the frequency-based Huffman codes. ]]<br />
</center><br />
<br />
=Learning Phrases=<br />
<br />
Many phrases have a meaning that is not a simple composition of the meanings of its individual words. To learn vector representation for phrases, we first find words that appear frequently together, and infrequently in other contexts. For example, “''New York Times''” and “''Toronto Maple Leafs''” are replaced by unique tokens in the training data, while a bigram “''this is''” will remain unchanged. This way, we can form many reasonable phrases without greatly increasing the size of the vocabulary; in theory, we can train the Skip-gram model using all n-grams, but that would be too memory intensive. A simple data-driven approach, where phrases are formed based on the unigram and bigram counts is applied to identify the phrases. In this approach, a ''score'' is calculated as:<br />
<br />
<math><br />
score(w_i,w_j)=\frac{count(w_iw_j)-\delta}{count(w_i)count(w_j)}<br />
</math><br />
<br />
The <math>\delta</math> is used as a discounting coefficient and prevents too many phrases consisting of very infrequent words to be formed. The bigrams with ''scores'' above the chosen threshold are then used as phrases. The quality of the phrase representations is evaluated using a new analogical reasoning task that involves phrases. Table 2 shows examples of the five categories of analogies used in this task.<br />
<br />
<center><br />
[[File:Tb_2.PNG | frame | center |Table 2. Examples of the analogical reasoning task for phrases (the full test set has 3218 examples). The goal is to compute the fourth phrase using the first three. Our best model achieved an accuracy of 72% on this dataset.]]<br />
</center><br />
<br />
==Phrase Skip-Gram Results==<br />
<br />
First, the phrase based training corpus is constructed and then Skip-gram models are trained using different hyperparameters. Table 3 shows the results using vector dimensionality 300 and context size 5. This setting already achieves good performance on the phrase dataset, and allowed us to quickly compare the Negative Sampling and the Hierarchical Softmax, both with and without subsampling of the frequent tokens. The results show that while Negative Sampling achieves a respectable accuracy even with ''k = 5'', using ''k = 15'' achieves considerably better performance. Also, the subsampling can result in faster training and can also improve accuracy, at least in some cases.<br />
<br />
<center><br />
[[File:Tb_3.PNG | frame | center |Table 3. Accuracies of the Skip-gram models on the phrase analogy dataset. The models were trained on approximately one billion words from the news dataset.]]<br />
</center><br />
<br />
<br />
The amount of the training data was increased to 33 billion words in order to maximize the accuracy on the phrase analogy task. Hierarchical softmax, dimensionality of 1000, and the entire sentence for the context were used. This resulted in a model that reached an accuracy of 72%. Reducing the size of the training dataset to 6 billion caused lower accuracy (66%), which suggests that large amount of the training data is crucial. To gain further insight into how different the representations learned by different models are, nearest neighbors of infrequent phrases were inspected manually using various models. In Table 4 shows a sample of such comparison. Consistently with the previous results, it seems that the best representations of phrases are learned by a model with the hierarchical softmax and subsampling.<br />
<br />
<center><br />
[[File:Tb_4.PNG | frame | center |Table 4. Examples of the closest entities to the given short phrases, using two different models.]]<br />
</center><br />
<br />
=Additive Compositionality=<br />
<br />
The word and phrase representations learned by the Skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetics. Also, the Skip-gram representations exhibit another kind of linear structure that makes it possible to meaningfully combine words by an element-wise addition of their vector representations. This phenomenon is illustrated in Table 5. The additive property of the vectors can be explained by inspecting the training objective. The word vectors are in a linear relationship with the inputs to the softmax nonlinearity. As the word vectors are trained to predict the surrounding words in the sentence, the vectors can be seen as representing the distribution of the context in which a word appears. These values are related logarithmically to the probabilities computed by the output layer, so the sum of two word vectors is related to the product of the two context distributions. The product works here as the AND function: words that are assigned high probabilities by both word vectors will have high probability, and the other words will have low probability.<br />
<br />
<center><br />
[[File:Tb_5.PNG | frame | center |Table 5. Vector compositionality using element-wise addition. Four closest tokens to the sum of two vectors are shown, using the best Skip-gram model.]]<br />
</center><br />
<br />
=Comparison to Published Word Representations=<br />
<br />
Table 6 shows the empirical comparison between different neural network-based representations of words by showing the nearest neighbors of infrequent words. These examples show that the big Skip-gram model trained on a large corpus visibly outperforms all the other models in the quality of the learned representations. This can be attributed in part to the fact that this model has been trained on about 30 billion words, which is about two to three orders of magnitude more data than the typical size used in the prior work. Interestingly, although the training set is much larger, the training time of the Skip-gram model is just a fraction of the time complexity required by the previous model architectures.<br />
<br />
<center><br />
[[File:Tb_6.PNG | frame | center |Table 6. Examples of the closest tokens given various well-known models and the Skip-gram model trained on phrases using over 30 billion training words. An empty cell means that the word was not in the vocabulary.]]<br />
</center><br />
<br />
=Conclusion=<br />
<br />
This work has the following key contributions:<br />
<br />
1. This work shows how to train distributed representations of words and phrases with the Skip-gram model and demonstrate that these representations exhibit linear structure that makes precise analogical reasoning possible.<br />
<br />
2. It is a computationally efficient model architecture which results in successfully train models on several orders of magnitude more data than the previously published models.<br />
<br />
3. Introducing Negative sampling algorithm, which is an extremely simple training method that learns accurate representations especially for frequent words.<br />
<br />
4. The choice of the training algorithm and the hyper-parameter selection is a task specific decision. It is shown that the most crucial decisions that affect the performance are the choice of the model architecture, the size of the vectors, the subsampling rate, and the size of the training window.<br />
<br />
5. The word vectors can be meaningfully combined using just simple vector addition. Another approach for learning representations of phrases presented in this paper is to simply represent the phrases with a single token. Combining these two approaches gives a powerful yet simple way how to represent longer pieces of text, while having minimal computational complexity.<br />
<br />
= Recursive Autoencoder =<br />
<br />
This is taken from paper 'Semi-supervised recursive autoencoders for predicting sentiment distributions'.<ref> Socher, et al. [http://www.socher.org/uploads/Main/SocherPenningtonHuangNgManning_EMNLP2011.pdf] </ref><br />
=== Other techniques for sentence representation ===<br />
<br />
The idea of Recursive Autoencoder is summarized in the figure below. The example illustrates the recursive autoencoder to a binary tree.<br />
<center><br />
[[File:Recur-auto.png]]<br />
</center><br />
<br />
Assume given a list of word vectors <math> x = (x_1, ..., x_m)</math>, we need to branch triplets of parents with children: <math> (y_1 \rightarrow x_3x_4), (y_2 \rightarrow x_2y_1), (y_3 \rightarrow x_1y_2) </math>.<br />
<br />
The first parent <math> y_1 </math> is computed from the children <math> (c_1, c_2) = (x_3, x_4)</math>: <math> p=f(W^{(1)}[c_1; c_2] + b^{(1)})</math> , where W is the parameter matrix and b is bias term. <br />
<br />
The autoencoder comes in by reconstructing children set <math> [c_1^'; c_2^'] = W^{(2)}p + b^{(2)}</math>. The object of this method is to minimized the MSE of original children set and the reconstructed children set.<br />
<br />
=Resources=<br />
<br />
The code for training the word and phrase vectors based on this paper is available in the open source project [https://code.google.com/p/word2vec/ word2vec]. This project also contains a set of pre-trained 300-dimensional vectors for 3 million words and phrases.<br />
<br />
=References=<br />
<references /></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=distributed_Representations_of_Words_and_Phrases_and_their_Compositionality&diff=27304distributed Representations of Words and Phrases and their Compositionality2015-12-14T19:24:37Z<p>Lruan: /* Other techniques for sentence representation */</p>
<hr />
<div>= Introduction =<br />
<br />
This paper<ref><br />
Mikolov, Tomas, et al. [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf "Distributed representations of words and phrases and their compositionality."] Advances in neural information processing systems. 2013.<br />
</ref> presents several extensions of the Skip-gram model introduced by Mikolov et al. <ref name=MiT> Mikolov, Tomas, ''et al'' [http://arxiv.org/pdf/1301.3781v3.pdf"Efficient Estimation of Word Representations in Vector Space"] in ICLR Workshop, (2013). </ref>. The Skip-gram model is an efficient method for learning high-quality vector representations of words from large amounts of unstructured text data. The word representations computed using this model are very interesting because the learned vectors explicitly encode many linguistic regularities and patterns. Somewhat surprisingly, many of these patterns can be represented as linear translations. For example, the result of a vector calculation vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector. The authors of this paper show that subsampling of frequent words during training results in a significant speedup and improves accuracy of the representations of less frequent words. In addition, a simplified variant of Noise Contrastive Estimation (NCE) <ref name=GuM><br />
Gutmann, Michael U, ''et al'' [http://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann12JMLR.pdf"Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics"] in The Journal ofMachine Learning Research, (2012).<br />
</ref>. for training the Skip-gram model is presented that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax that was used in the prior work <ref name=MiT></ref>. It also shows that a non-obvious degree of language understanding can be obtained by using basic mathematical operations on the word vector representations. For example, vec(“Russia”) + vec(“river”) is close to vec(“Volga River”), and vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”).<br />
<br />
= The Skip-gram Model =<br />
<br />
The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document. More formally, given a sequence of training words <math>w_1, w_2,..., w_T</math> the objective of the Skip-gram model is to maximize the average log probability:<br />
<br />
<math><br />
\frac{1}{T} \sum_{t=1}^{T} \sum_{-c\leq j\leq c} log(p(w_{t+j}|w_t))<br />
</math><br />
<br /><br />
<br /><br />
where <math>c</math> is the size of the training context (which can be a function of the center word <math>w_t</math>) and <math>p(w_{t+j}|w_t)</math> is defined using softmax function:<br />
<br />
<math><br />
p(w_O|w_I) = \frac{exp ({v'_{W_O}}^T v_{W_I})}{\sum{w=1}^{W} exp ({v'_{W}}^T v_{W_I})}<br />
</math><br />
<br />
Here, <math>v_w</math> and <math>v'_w</math> are the “''input''” and “''output''” vector representations of <math>w</math>, and <math>W</math> is the number of words in the vocabulary.<br />
<br />
== Hierarchical Softmax ==<br />
<br />
Hierarchical Softmax is a computationally efficient approximation of the full softmax <ref name=MoF><br />
Morin, Frederic, ''et al'' [http://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf"Hierarchical probabilistic neural network language model"] in Proceedings of the international workshop on artificial intelligence and statistics, (2015).<br />
</ref>. Hierarchical Softmax evaluate only about <math>log_2(W)</math> output nodes instead of evaluating <math>W</math> nodes in the neural network to obtain the probability distribution.<br />
<br />
The hierarchical softmax uses a binary tree representation of the output layer with the <math>W</math> words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes. These define a random walk that assigns probabilities to words.<br />
<br />
Let <math>n(w,j)</math> be the <math>j^{th}</math> node on the path from the root to <math>w</math>, and let <math>L(w)</math> be the length of this path, so <math>n(w,1) = root</math> and <math>n(w,L(w)) = w</math>. In addition, for any inner node <math>n</math>, let <math>ch(n)</math> be an arbitrary fixed child of <math>n</math> and let <math>[[x]]</math> be 1 if <math>x</math> is true and -1 otherwise. Then the hierarchical softmax defines <math>p(w_O|w_I )</math> as follows:<br />
<br />
<math><br />
p(w|w_I) = \prod_{j=1}^{L(w)-1} \sigma ([[n(w,j+1)=ch(n(w,j))]]{v'_{n(w,j)}}^T v_{W_I}) <br />
</math><br />
<br />
where<br />
<br />
<math><br />
\sigma (x)=\frac{1}{1+exp(-x)}<br />
</math><br />
<br />
In this paper, a binary Huffman tree is used as the structure for the hierarchical softmax because it assigns short codes to the frequent words which results in fast training. It has been observed before that grouping words together by their frequency works well as a very simple speedup technique for the neural network based language models <ref name=MiT></ref><ref name=MiT2><br />
Mikolov, Tomas, ''et al'' [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5947611"Extensions of recurrent neural network language model."] in Acoustics, Speech and Signal Processing (ICASSP), (2011).<br />
</ref>.<br />
<br />
== Negative Sampling==<br />
<br />
Noise Contrastive Estimation (NCE) is an alternative to the hierarchical softmax. NCE indicates that a good model should be able to differentiate data from noise by means of logistic regression. While NCE can be shown to approximately maximize the log probability of the softmax, the Skipgram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality. Negative sampling (NEG) is defined by the objective:<br />
<br />
<math><br />
log \sigma ({v'_{W_O}}^T v_{W_I})+\sum_{i=1}^{k} \mathbb{E}_{w_i\sim P_n(w)}[log \sigma ({-v'_{W_i}}^T v_{W_I})]<br />
</math><br />
<br />
The main difference between the Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples. And while NCE approximately maximizes the log probability of the softmax, this property is not important for our application.<br /><br />
Both NCE and NEG have the noise distribution <math>P_n(w)</math> as a free parameter. We investigated a number of choices for <math>P_n(w)</math> and found that the unigram distribution <math>U(w)</math> raised to the 3/4rd power (i.e., <math>U(w)^{3/4}/Z)</math> outperformed significantly the unigram and the uniform distributions, for both NCE and NEG on every task we tried including language modeling.<br />
<br />
==Subsampling of Frequent Words==<br />
In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g., “in”, “the”, and “a”). Such words usually provide less information about the surrounding words that rarer words (i.e., "the" provides little information about the next word because it co-occurs with a huge number of words), and the representation of the frequent word will be unlikely to change significantly after many iterations. <br />
<br />
To counter the imbalance between the rare and frequent words, a simple subsampling approach is used. Each word <math>w_i</math> in the training set is discarded with probability computed by the formula:<br />
<br />
<math><br />
P(w_i)=1-\sqrt{\frac{1}{f(w_i)}}<br />
</math><br />
<br />
where <math>f(w_i)</math> is the frequency of word <math>w_i</math> and <math>t</math> is a chosen threshold, typically around <math>10^{−5}</math>.<br />
<br />
= Empirical Results=<br />
<br />
The Hierarchical Softmax (HS), Noise Contrastive Estimation, Negative Sampling, and subsampling of the training words are evaluated with the help of the analogical reasoning task1 <ref name=MiT></ref>. The task consists of analogies such as “Germany” : “Berlin” :: “France” : ?, which are solved by finding a vector ''x'' such that vec(''x'') is closest to vec(“Berlin”) - vec(“Germany”) + vec(“France”) according to the cosine distance. This specific example is considered to have been answered correctly if ''x'' is “Paris”. The task has two broad categories: the syntactic analogies (such as “quick” : “quickly” :: “slow” : “slowly”) and the semantic analogies, such as the country to capital city relationship.<br />
<br />
For training the Skip-gram models, a large dataset consisting of various news articles is used (an internal Google dataset with one billion words). All words that occurred less than 5 times in the training data were discarded, which resulted in a vocabulary of size 692K. The performance of various Skip-gram models on the word analogy test set is reported in Table 1. The table shows that Negative Sampling outperforms the Hierarchical Softmax on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation. The subsampling of the frequent words improves the training speed several times and makes the word representations significantly more accurate.<br />
<br />
<center><br />
[[File:Tb_1.PNG | frame | center |Table 1. Accuracy of various Skip-gram 300-dimensional models on the analogical reasoning task as defined in <ref name=MiT></ref>. NEG-''k'' stands for Negative Sampling with ''k'' negative samples for each positive sample; NCE stands for Noise Contrastive Estimation and HS-Huffman stands for the Hierarchical Softmax with the frequency-based Huffman codes. ]]<br />
</center><br />
<br />
=Learning Phrases=<br />
<br />
Many phrases have a meaning that is not a simple composition of the meanings of its individual words. To learn vector representation for phrases, we first find words that appear frequently together, and infrequently in other contexts. For example, “''New York Times''” and “''Toronto Maple Leafs''” are replaced by unique tokens in the training data, while a bigram “''this is''” will remain unchanged. This way, we can form many reasonable phrases without greatly increasing the size of the vocabulary; in theory, we can train the Skip-gram model using all n-grams, but that would be too memory intensive. A simple data-driven approach, where phrases are formed based on the unigram and bigram counts is applied to identify the phrases. In this approach, a ''score'' is calculated as:<br />
<br />
<math><br />
score(w_i,w_j)=\frac{count(w_iw_j)-\delta}{count(w_i)count(w_j)}<br />
</math><br />
<br />
The <math>\delta</math> is used as a discounting coefficient and prevents too many phrases consisting of very infrequent words to be formed. The bigrams with ''scores'' above the chosen threshold are then used as phrases. The quality of the phrase representations is evaluated using a new analogical reasoning task that involves phrases. Table 2 shows examples of the five categories of analogies used in this task.<br />
<br />
<center><br />
[[File:Tb_2.PNG | frame | center |Table 2. Examples of the analogical reasoning task for phrases (the full test set has 3218 examples). The goal is to compute the fourth phrase using the first three. Our best model achieved an accuracy of 72% on this dataset.]]<br />
</center><br />
<br />
==Phrase Skip-Gram Results==<br />
<br />
First, the phrase based training corpus is constructed and then Skip-gram models are trained using different hyperparameters. Table 3 shows the results using vector dimensionality 300 and context size 5. This setting already achieves good performance on the phrase dataset, and allowed us to quickly compare the Negative Sampling and the Hierarchical Softmax, both with and without subsampling of the frequent tokens. The results show that while Negative Sampling achieves a respectable accuracy even with ''k = 5'', using ''k = 15'' achieves considerably better performance. Also, the subsampling can result in faster training and can also improve accuracy, at least in some cases.<br />
<br />
<center><br />
[[File:Tb_3.PNG | frame | center |Table 3. Accuracies of the Skip-gram models on the phrase analogy dataset. The models were trained on approximately one billion words from the news dataset.]]<br />
</center><br />
<br />
<br />
The amount of the training data was increased to 33 billion words in order to maximize the accuracy on the phrase analogy task. Hierarchical softmax, dimensionality of 1000, and the entire sentence for the context were used. This resulted in a model that reached an accuracy of 72%. Reducing the size of the training dataset to 6 billion caused lower accuracy (66%), which suggests that large amount of the training data is crucial. To gain further insight into how different the representations learned by different models are, nearest neighbors of infrequent phrases were inspected manually using various models. In Table 4 shows a sample of such comparison. Consistently with the previous results, it seems that the best representations of phrases are learned by a model with the hierarchical softmax and subsampling.<br />
<br />
<center><br />
[[File:Tb_4.PNG | frame | center |Table 4. Examples of the closest entities to the given short phrases, using two different models.]]<br />
</center><br />
<br />
=Additive Compositionality=<br />
<br />
The word and phrase representations learned by the Skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetics. Also, the Skip-gram representations exhibit another kind of linear structure that makes it possible to meaningfully combine words by an element-wise addition of their vector representations. This phenomenon is illustrated in Table 5. The additive property of the vectors can be explained by inspecting the training objective. The word vectors are in a linear relationship with the inputs to the softmax nonlinearity. As the word vectors are trained to predict the surrounding words in the sentence, the vectors can be seen as representing the distribution of the context in which a word appears. These values are related logarithmically to the probabilities computed by the output layer, so the sum of two word vectors is related to the product of the two context distributions. The product works here as the AND function: words that are assigned high probabilities by both word vectors will have high probability, and the other words will have low probability.<br />
<br />
<center><br />
[[File:Tb_5.PNG | frame | center |Table 5. Vector compositionality using element-wise addition. Four closest tokens to the sum of two vectors are shown, using the best Skip-gram model.]]<br />
</center><br />
<br />
=Comparison to Published Word Representations=<br />
<br />
Table 6 shows the empirical comparison between different neural network-based representations of words by showing the nearest neighbors of infrequent words. These examples show that the big Skip-gram model trained on a large corpus visibly outperforms all the other models in the quality of the learned representations. This can be attributed in part to the fact that this model has been trained on about 30 billion words, which is about two to three orders of magnitude more data than the typical size used in the prior work. Interestingly, although the training set is much larger, the training time of the Skip-gram model is just a fraction of the time complexity required by the previous model architectures.<br />
<br />
<center><br />
[[File:Tb_6.PNG | frame | center |Table 6. Examples of the closest tokens given various well-known models and the Skip-gram model trained on phrases using over 30 billion training words. An empty cell means that the word was not in the vocabulary.]]<br />
</center><br />
<br />
=Conclusion=<br />
<br />
This work has the following key contributions:<br />
<br />
1. This work shows how to train distributed representations of words and phrases with the Skip-gram model and demonstrate that these representations exhibit linear structure that makes precise analogical reasoning possible.<br />
<br />
2. It is a computationally efficient model architecture which results in successfully train models on several orders of magnitude more data than the previously published models.<br />
<br />
3. Introducing Negative sampling algorithm, which is an extremely simple training method that learns accurate representations especially for frequent words.<br />
<br />
4. The choice of the training algorithm and the hyper-parameter selection is a task specific decision. It is shown that the most crucial decisions that affect the performance are the choice of the model architecture, the size of the vectors, the subsampling rate, and the size of the training window.<br />
<br />
5. The word vectors can be meaningfully combined using just simple vector addition. Another approach for learning representations of phrases presented in this paper is to simply represent the phrases with a single token. Combining these two approaches gives a powerful yet simple way how to represent longer pieces of text, while having minimal computational complexity.<br />
<br />
= Recursive Autoencoder =<br />
<br />
This is taken from paper 'Semi-supervised recursive autoencoders for predicting sentiment distributions'.<ref> Socher, et al. [http://www.socher.org/uploads/Main/SocherPenningtonHuangNgManning_EMNLP2011.pdf] </ref><br />
=== Other techniques for sentence representation ===<br />
<br />
The idea of Recursive Autoencoder is summarized in the figure below. The example illustrates the recursive autoencoder to a binary tree.<br />
<center><br />
[[File:Recur-auto.PNG]]<br />
</center><br />
<br />
Assume given a list of word vectors <math> x = (x_1, ..., x_m)</math>, we need to branch triplets of parents with children: <math> (y_1 \rightarrow x_3x_4), (y_2 \rightarrow x_2y_1), (y_3 \rightarrow x_1y_2) </math>.<br />
<br />
The first parent <math> y_1 </math> is computed from the children <math> (c_1, c_2) = (x_3, x_4)</math>: <math> p=f(W^{(1)}[c_1; c_2] + b^{(1)})</math> , where W is the parameter matrix and b is bias term. <br />
<br />
The autoencoder comes in by reconstructing children set <math> [c_1^'; c_2^'] = W^{(2)}p + b^{(2)}</math>. The object of this method is to minimized the MSE of original children set and the reconstructed children set.<br />
<br />
=Resources=<br />
<br />
The code for training the word and phrase vectors based on this paper is available in the open source project [https://code.google.com/p/word2vec/ word2vec]. This project also contains a set of pre-trained 300-dimensional vectors for 3 million words and phrases.<br />
<br />
=References=<br />
<references /></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Recur-auto.png&diff=27303File:Recur-auto.png2015-12-14T19:23:39Z<p>Lruan: uploaded a new version of &quot;File:Recur-auto.png&quot;</p>
<hr />
<div></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=distributed_Representations_of_Words_and_Phrases_and_their_Compositionality&diff=27302distributed Representations of Words and Phrases and their Compositionality2015-12-14T19:22:39Z<p>Lruan: /* Other techniques for sentence representation */</p>
<hr />
<div>= Introduction =<br />
<br />
This paper<ref><br />
Mikolov, Tomas, et al. [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf "Distributed representations of words and phrases and their compositionality."] Advances in neural information processing systems. 2013.<br />
</ref> presents several extensions of the Skip-gram model introduced by Mikolov et al. <ref name=MiT> Mikolov, Tomas, ''et al'' [http://arxiv.org/pdf/1301.3781v3.pdf"Efficient Estimation of Word Representations in Vector Space"] in ICLR Workshop, (2013). </ref>. The Skip-gram model is an efficient method for learning high-quality vector representations of words from large amounts of unstructured text data. The word representations computed using this model are very interesting because the learned vectors explicitly encode many linguistic regularities and patterns. Somewhat surprisingly, many of these patterns can be represented as linear translations. For example, the result of a vector calculation vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector. The authors of this paper show that subsampling of frequent words during training results in a significant speedup and improves accuracy of the representations of less frequent words. In addition, a simplified variant of Noise Contrastive Estimation (NCE) <ref name=GuM><br />
Gutmann, Michael U, ''et al'' [http://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann12JMLR.pdf"Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics"] in The Journal ofMachine Learning Research, (2012).<br />
</ref>. for training the Skip-gram model is presented that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax that was used in the prior work <ref name=MiT></ref>. It also shows that a non-obvious degree of language understanding can be obtained by using basic mathematical operations on the word vector representations. For example, vec(“Russia”) + vec(“river”) is close to vec(“Volga River”), and vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”).<br />
<br />
= The Skip-gram Model =<br />
<br />
The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document. More formally, given a sequence of training words <math>w_1, w_2,..., w_T</math> the objective of the Skip-gram model is to maximize the average log probability:<br />
<br />
<math><br />
\frac{1}{T} \sum_{t=1}^{T} \sum_{-c\leq j\leq c} log(p(w_{t+j}|w_t))<br />
</math><br />
<br /><br />
<br /><br />
where <math>c</math> is the size of the training context (which can be a function of the center word <math>w_t</math>) and <math>p(w_{t+j}|w_t)</math> is defined using softmax function:<br />
<br />
<math><br />
p(w_O|w_I) = \frac{exp ({v'_{W_O}}^T v_{W_I})}{\sum{w=1}^{W} exp ({v'_{W}}^T v_{W_I})}<br />
</math><br />
<br />
Here, <math>v_w</math> and <math>v'_w</math> are the “''input''” and “''output''” vector representations of <math>w</math>, and <math>W</math> is the number of words in the vocabulary.<br />
<br />
== Hierarchical Softmax ==<br />
<br />
Hierarchical Softmax is a computationally efficient approximation of the full softmax <ref name=MoF><br />
Morin, Frederic, ''et al'' [http://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf"Hierarchical probabilistic neural network language model"] in Proceedings of the international workshop on artificial intelligence and statistics, (2015).<br />
</ref>. Hierarchical Softmax evaluate only about <math>log_2(W)</math> output nodes instead of evaluating <math>W</math> nodes in the neural network to obtain the probability distribution.<br />
<br />
The hierarchical softmax uses a binary tree representation of the output layer with the <math>W</math> words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes. These define a random walk that assigns probabilities to words.<br />
<br />
Let <math>n(w,j)</math> be the <math>j^{th}</math> node on the path from the root to <math>w</math>, and let <math>L(w)</math> be the length of this path, so <math>n(w,1) = root</math> and <math>n(w,L(w)) = w</math>. In addition, for any inner node <math>n</math>, let <math>ch(n)</math> be an arbitrary fixed child of <math>n</math> and let <math>[[x]]</math> be 1 if <math>x</math> is true and -1 otherwise. Then the hierarchical softmax defines <math>p(w_O|w_I )</math> as follows:<br />
<br />
<math><br />
p(w|w_I) = \prod_{j=1}^{L(w)-1} \sigma ([[n(w,j+1)=ch(n(w,j))]]{v'_{n(w,j)}}^T v_{W_I}) <br />
</math><br />
<br />
where<br />
<br />
<math><br />
\sigma (x)=\frac{1}{1+exp(-x)}<br />
</math><br />
<br />
In this paper, a binary Huffman tree is used as the structure for the hierarchical softmax because it assigns short codes to the frequent words which results in fast training. It has been observed before that grouping words together by their frequency works well as a very simple speedup technique for the neural network based language models <ref name=MiT></ref><ref name=MiT2><br />
Mikolov, Tomas, ''et al'' [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5947611"Extensions of recurrent neural network language model."] in Acoustics, Speech and Signal Processing (ICASSP), (2011).<br />
</ref>.<br />
<br />
== Negative Sampling==<br />
<br />
Noise Contrastive Estimation (NCE) is an alternative to the hierarchical softmax. NCE indicates that a good model should be able to differentiate data from noise by means of logistic regression. While NCE can be shown to approximately maximize the log probability of the softmax, the Skipgram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality. Negative sampling (NEG) is defined by the objective:<br />
<br />
<math><br />
log \sigma ({v'_{W_O}}^T v_{W_I})+\sum_{i=1}^{k} \mathbb{E}_{w_i\sim P_n(w)}[log \sigma ({-v'_{W_i}}^T v_{W_I})]<br />
</math><br />
<br />
The main difference between the Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples. And while NCE approximately maximizes the log probability of the softmax, this property is not important for our application.<br /><br />
Both NCE and NEG have the noise distribution <math>P_n(w)</math> as a free parameter. We investigated a number of choices for <math>P_n(w)</math> and found that the unigram distribution <math>U(w)</math> raised to the 3/4rd power (i.e., <math>U(w)^{3/4}/Z)</math> outperformed significantly the unigram and the uniform distributions, for both NCE and NEG on every task we tried including language modeling.<br />
<br />
==Subsampling of Frequent Words==<br />
In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g., “in”, “the”, and “a”). Such words usually provide less information about the surrounding words that rarer words (i.e., "the" provides little information about the next word because it co-occurs with a huge number of words), and the representation of the frequent word will be unlikely to change significantly after many iterations. <br />
<br />
To counter the imbalance between the rare and frequent words, a simple subsampling approach is used. Each word <math>w_i</math> in the training set is discarded with probability computed by the formula:<br />
<br />
<math><br />
P(w_i)=1-\sqrt{\frac{1}{f(w_i)}}<br />
</math><br />
<br />
where <math>f(w_i)</math> is the frequency of word <math>w_i</math> and <math>t</math> is a chosen threshold, typically around <math>10^{−5}</math>.<br />
<br />
= Empirical Results=<br />
<br />
The Hierarchical Softmax (HS), Noise Contrastive Estimation, Negative Sampling, and subsampling of the training words are evaluated with the help of the analogical reasoning task1 <ref name=MiT></ref>. The task consists of analogies such as “Germany” : “Berlin” :: “France” : ?, which are solved by finding a vector ''x'' such that vec(''x'') is closest to vec(“Berlin”) - vec(“Germany”) + vec(“France”) according to the cosine distance. This specific example is considered to have been answered correctly if ''x'' is “Paris”. The task has two broad categories: the syntactic analogies (such as “quick” : “quickly” :: “slow” : “slowly”) and the semantic analogies, such as the country to capital city relationship.<br />
<br />
For training the Skip-gram models, a large dataset consisting of various news articles is used (an internal Google dataset with one billion words). All words that occurred less than 5 times in the training data were discarded, which resulted in a vocabulary of size 692K. The performance of various Skip-gram models on the word analogy test set is reported in Table 1. The table shows that Negative Sampling outperforms the Hierarchical Softmax on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation. The subsampling of the frequent words improves the training speed several times and makes the word representations significantly more accurate.<br />
<br />
<center><br />
[[File:Tb_1.PNG | frame | center |Table 1. Accuracy of various Skip-gram 300-dimensional models on the analogical reasoning task as defined in <ref name=MiT></ref>. NEG-''k'' stands for Negative Sampling with ''k'' negative samples for each positive sample; NCE stands for Noise Contrastive Estimation and HS-Huffman stands for the Hierarchical Softmax with the frequency-based Huffman codes. ]]<br />
</center><br />
<br />
=Learning Phrases=<br />
<br />
Many phrases have a meaning that is not a simple composition of the meanings of its individual words. To learn vector representation for phrases, we first find words that appear frequently together, and infrequently in other contexts. For example, “''New York Times''” and “''Toronto Maple Leafs''” are replaced by unique tokens in the training data, while a bigram “''this is''” will remain unchanged. This way, we can form many reasonable phrases without greatly increasing the size of the vocabulary; in theory, we can train the Skip-gram model using all n-grams, but that would be too memory intensive. A simple data-driven approach, where phrases are formed based on the unigram and bigram counts is applied to identify the phrases. In this approach, a ''score'' is calculated as:<br />
<br />
<math><br />
score(w_i,w_j)=\frac{count(w_iw_j)-\delta}{count(w_i)count(w_j)}<br />
</math><br />
<br />
The <math>\delta</math> is used as a discounting coefficient and prevents too many phrases consisting of very infrequent words to be formed. The bigrams with ''scores'' above the chosen threshold are then used as phrases. The quality of the phrase representations is evaluated using a new analogical reasoning task that involves phrases. Table 2 shows examples of the five categories of analogies used in this task.<br />
<br />
<center><br />
[[File:Tb_2.PNG | frame | center |Table 2. Examples of the analogical reasoning task for phrases (the full test set has 3218 examples). The goal is to compute the fourth phrase using the first three. Our best model achieved an accuracy of 72% on this dataset.]]<br />
</center><br />
<br />
==Phrase Skip-Gram Results==<br />
<br />
First, the phrase based training corpus is constructed and then Skip-gram models are trained using different hyperparameters. Table 3 shows the results using vector dimensionality 300 and context size 5. This setting already achieves good performance on the phrase dataset, and allowed us to quickly compare the Negative Sampling and the Hierarchical Softmax, both with and without subsampling of the frequent tokens. The results show that while Negative Sampling achieves a respectable accuracy even with ''k = 5'', using ''k = 15'' achieves considerably better performance. Also, the subsampling can result in faster training and can also improve accuracy, at least in some cases.<br />
<br />
<center><br />
[[File:Tb_3.PNG | frame | center |Table 3. Accuracies of the Skip-gram models on the phrase analogy dataset. The models were trained on approximately one billion words from the news dataset.]]<br />
</center><br />
<br />
<br />
The amount of the training data was increased to 33 billion words in order to maximize the accuracy on the phrase analogy task. Hierarchical softmax, dimensionality of 1000, and the entire sentence for the context were used. This resulted in a model that reached an accuracy of 72%. Reducing the size of the training dataset to 6 billion caused lower accuracy (66%), which suggests that large amount of the training data is crucial. To gain further insight into how different the representations learned by different models are, nearest neighbors of infrequent phrases were inspected manually using various models. In Table 4 shows a sample of such comparison. Consistently with the previous results, it seems that the best representations of phrases are learned by a model with the hierarchical softmax and subsampling.<br />
<br />
<center><br />
[[File:Tb_4.PNG | frame | center |Table 4. Examples of the closest entities to the given short phrases, using two different models.]]<br />
</center><br />
<br />
=Additive Compositionality=<br />
<br />
The word and phrase representations learned by the Skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetics. Also, the Skip-gram representations exhibit another kind of linear structure that makes it possible to meaningfully combine words by an element-wise addition of their vector representations. This phenomenon is illustrated in Table 5. The additive property of the vectors can be explained by inspecting the training objective. The word vectors are in a linear relationship with the inputs to the softmax nonlinearity. As the word vectors are trained to predict the surrounding words in the sentence, the vectors can be seen as representing the distribution of the context in which a word appears. These values are related logarithmically to the probabilities computed by the output layer, so the sum of two word vectors is related to the product of the two context distributions. The product works here as the AND function: words that are assigned high probabilities by both word vectors will have high probability, and the other words will have low probability.<br />
<br />
<center><br />
[[File:Tb_5.PNG | frame | center |Table 5. Vector compositionality using element-wise addition. Four closest tokens to the sum of two vectors are shown, using the best Skip-gram model.]]<br />
</center><br />
<br />
=Comparison to Published Word Representations=<br />
<br />
Table 6 shows the empirical comparison between different neural network-based representations of words by showing the nearest neighbors of infrequent words. These examples show that the big Skip-gram model trained on a large corpus visibly outperforms all the other models in the quality of the learned representations. This can be attributed in part to the fact that this model has been trained on about 30 billion words, which is about two to three orders of magnitude more data than the typical size used in the prior work. Interestingly, although the training set is much larger, the training time of the Skip-gram model is just a fraction of the time complexity required by the previous model architectures.<br />
<br />
<center><br />
[[File:Tb_6.PNG | frame | center |Table 6. Examples of the closest tokens given various well-known models and the Skip-gram model trained on phrases using over 30 billion training words. An empty cell means that the word was not in the vocabulary.]]<br />
</center><br />
<br />
=Conclusion=<br />
<br />
This work has the following key contributions:<br />
<br />
1. This work shows how to train distributed representations of words and phrases with the Skip-gram model and demonstrate that these representations exhibit linear structure that makes precise analogical reasoning possible.<br />
<br />
2. It is a computationally efficient model architecture which results in successfully train models on several orders of magnitude more data than the previously published models.<br />
<br />
3. Introducing Negative sampling algorithm, which is an extremely simple training method that learns accurate representations especially for frequent words.<br />
<br />
4. The choice of the training algorithm and the hyper-parameter selection is a task specific decision. It is shown that the most crucial decisions that affect the performance are the choice of the model architecture, the size of the vectors, the subsampling rate, and the size of the training window.<br />
<br />
5. The word vectors can be meaningfully combined using just simple vector addition. Another approach for learning representations of phrases presented in this paper is to simply represent the phrases with a single token. Combining these two approaches gives a powerful yet simple way how to represent longer pieces of text, while having minimal computational complexity.<br />
<br />
= Recursive Autoencoder =<br />
<br />
This is taken from paper 'Semi-supervised recursive autoencoders for predicting sentiment distributions'.<ref> Socher, et al. [http://www.socher.org/uploads/Main/SocherPenningtonHuangNgManning_EMNLP2011.pdf] </ref><br />
=== Other techniques for sentence representation ===<br />
<br />
The idea of Recursive Autoencoder is summarized in the figure below. The example illustrates the recursive autoencoder to a binary tree.<br />
<center><br />
[[File:Recur_auto.PNG]]<br />
</center><br />
<br />
Assume given a list of word vectors <math> x = (x_1, ..., x_m)</math>, we need to branch triplets of parents with children: <math> (y_1 \rightarrow x_3x_4), (y_2 \rightarrow x_2y_1), (y_3 \rightarrow x_1y_2) </math>.<br />
<br />
The first parent <math> y_1 </math> is computed from the children <math> (c_1, c_2) = (x_3, x_4)</math>: <math> p=f(W^{(1)}[c_1; c_2] + b^{(1)})</math> , where W is the parameter matrix and b is bias term. <br />
<br />
The autoencoder comes in by reconstructing children set <math> [c_1^'; c_2^'] = W^{(2)}p + b^{(2)}</math>. The object of this method is to minimized the MSE of original children set and the reconstructed children set.<br />
<br />
=Resources=<br />
<br />
The code for training the word and phrase vectors based on this paper is available in the open source project [https://code.google.com/p/word2vec/ word2vec]. This project also contains a set of pre-trained 300-dimensional vectors for 3 million words and phrases.<br />
<br />
=References=<br />
<references /></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=distributed_Representations_of_Words_and_Phrases_and_their_Compositionality&diff=27301distributed Representations of Words and Phrases and their Compositionality2015-12-14T19:22:10Z<p>Lruan: </p>
<hr />
<div>= Introduction =<br />
<br />
This paper<ref><br />
Mikolov, Tomas, et al. [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf "Distributed representations of words and phrases and their compositionality."] Advances in neural information processing systems. 2013.<br />
</ref> presents several extensions of the Skip-gram model introduced by Mikolov et al. <ref name=MiT> Mikolov, Tomas, ''et al'' [http://arxiv.org/pdf/1301.3781v3.pdf"Efficient Estimation of Word Representations in Vector Space"] in ICLR Workshop, (2013). </ref>. The Skip-gram model is an efficient method for learning high-quality vector representations of words from large amounts of unstructured text data. The word representations computed using this model are very interesting because the learned vectors explicitly encode many linguistic regularities and patterns. Somewhat surprisingly, many of these patterns can be represented as linear translations. For example, the result of a vector calculation vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector. The authors of this paper show that subsampling of frequent words during training results in a significant speedup and improves accuracy of the representations of less frequent words. In addition, a simplified variant of Noise Contrastive Estimation (NCE) <ref name=GuM><br />
Gutmann, Michael U, ''et al'' [http://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann12JMLR.pdf"Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics"] in The Journal ofMachine Learning Research, (2012).<br />
</ref>. for training the Skip-gram model is presented that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax that was used in the prior work <ref name=MiT></ref>. It also shows that a non-obvious degree of language understanding can be obtained by using basic mathematical operations on the word vector representations. For example, vec(“Russia”) + vec(“river”) is close to vec(“Volga River”), and vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”).<br />
<br />
= The Skip-gram Model =<br />
<br />
The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document. More formally, given a sequence of training words <math>w_1, w_2,..., w_T</math> the objective of the Skip-gram model is to maximize the average log probability:<br />
<br />
<math><br />
\frac{1}{T} \sum_{t=1}^{T} \sum_{-c\leq j\leq c} log(p(w_{t+j}|w_t))<br />
</math><br />
<br /><br />
<br /><br />
where <math>c</math> is the size of the training context (which can be a function of the center word <math>w_t</math>) and <math>p(w_{t+j}|w_t)</math> is defined using softmax function:<br />
<br />
<math><br />
p(w_O|w_I) = \frac{exp ({v'_{W_O}}^T v_{W_I})}{\sum{w=1}^{W} exp ({v'_{W}}^T v_{W_I})}<br />
</math><br />
<br />
Here, <math>v_w</math> and <math>v'_w</math> are the “''input''” and “''output''” vector representations of <math>w</math>, and <math>W</math> is the number of words in the vocabulary.<br />
<br />
== Hierarchical Softmax ==<br />
<br />
Hierarchical Softmax is a computationally efficient approximation of the full softmax <ref name=MoF><br />
Morin, Frederic, ''et al'' [http://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf"Hierarchical probabilistic neural network language model"] in Proceedings of the international workshop on artificial intelligence and statistics, (2015).<br />
</ref>. Hierarchical Softmax evaluate only about <math>log_2(W)</math> output nodes instead of evaluating <math>W</math> nodes in the neural network to obtain the probability distribution.<br />
<br />
The hierarchical softmax uses a binary tree representation of the output layer with the <math>W</math> words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes. These define a random walk that assigns probabilities to words.<br />
<br />
Let <math>n(w,j)</math> be the <math>j^{th}</math> node on the path from the root to <math>w</math>, and let <math>L(w)</math> be the length of this path, so <math>n(w,1) = root</math> and <math>n(w,L(w)) = w</math>. In addition, for any inner node <math>n</math>, let <math>ch(n)</math> be an arbitrary fixed child of <math>n</math> and let <math>[[x]]</math> be 1 if <math>x</math> is true and -1 otherwise. Then the hierarchical softmax defines <math>p(w_O|w_I )</math> as follows:<br />
<br />
<math><br />
p(w|w_I) = \prod_{j=1}^{L(w)-1} \sigma ([[n(w,j+1)=ch(n(w,j))]]{v'_{n(w,j)}}^T v_{W_I}) <br />
</math><br />
<br />
where<br />
<br />
<math><br />
\sigma (x)=\frac{1}{1+exp(-x)}<br />
</math><br />
<br />
In this paper, a binary Huffman tree is used as the structure for the hierarchical softmax because it assigns short codes to the frequent words which results in fast training. It has been observed before that grouping words together by their frequency works well as a very simple speedup technique for the neural network based language models <ref name=MiT></ref><ref name=MiT2><br />
Mikolov, Tomas, ''et al'' [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5947611"Extensions of recurrent neural network language model."] in Acoustics, Speech and Signal Processing (ICASSP), (2011).<br />
</ref>.<br />
<br />
== Negative Sampling==<br />
<br />
Noise Contrastive Estimation (NCE) is an alternative to the hierarchical softmax. NCE indicates that a good model should be able to differentiate data from noise by means of logistic regression. While NCE can be shown to approximately maximize the log probability of the softmax, the Skipgram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality. Negative sampling (NEG) is defined by the objective:<br />
<br />
<math><br />
log \sigma ({v'_{W_O}}^T v_{W_I})+\sum_{i=1}^{k} \mathbb{E}_{w_i\sim P_n(w)}[log \sigma ({-v'_{W_i}}^T v_{W_I})]<br />
</math><br />
<br />
The main difference between the Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples. And while NCE approximately maximizes the log probability of the softmax, this property is not important for our application.<br /><br />
Both NCE and NEG have the noise distribution <math>P_n(w)</math> as a free parameter. We investigated a number of choices for <math>P_n(w)</math> and found that the unigram distribution <math>U(w)</math> raised to the 3/4rd power (i.e., <math>U(w)^{3/4}/Z)</math> outperformed significantly the unigram and the uniform distributions, for both NCE and NEG on every task we tried including language modeling.<br />
<br />
==Subsampling of Frequent Words==<br />
In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g., “in”, “the”, and “a”). Such words usually provide less information about the surrounding words that rarer words (i.e., "the" provides little information about the next word because it co-occurs with a huge number of words), and the representation of the frequent word will be unlikely to change significantly after many iterations. <br />
<br />
To counter the imbalance between the rare and frequent words, a simple subsampling approach is used. Each word <math>w_i</math> in the training set is discarded with probability computed by the formula:<br />
<br />
<math><br />
P(w_i)=1-\sqrt{\frac{1}{f(w_i)}}<br />
</math><br />
<br />
where <math>f(w_i)</math> is the frequency of word <math>w_i</math> and <math>t</math> is a chosen threshold, typically around <math>10^{−5}</math>.<br />
<br />
= Empirical Results=<br />
<br />
The Hierarchical Softmax (HS), Noise Contrastive Estimation, Negative Sampling, and subsampling of the training words are evaluated with the help of the analogical reasoning task1 <ref name=MiT></ref>. The task consists of analogies such as “Germany” : “Berlin” :: “France” : ?, which are solved by finding a vector ''x'' such that vec(''x'') is closest to vec(“Berlin”) - vec(“Germany”) + vec(“France”) according to the cosine distance. This specific example is considered to have been answered correctly if ''x'' is “Paris”. The task has two broad categories: the syntactic analogies (such as “quick” : “quickly” :: “slow” : “slowly”) and the semantic analogies, such as the country to capital city relationship.<br />
<br />
For training the Skip-gram models, a large dataset consisting of various news articles is used (an internal Google dataset with one billion words). All words that occurred less than 5 times in the training data were discarded, which resulted in a vocabulary of size 692K. The performance of various Skip-gram models on the word analogy test set is reported in Table 1. The table shows that Negative Sampling outperforms the Hierarchical Softmax on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation. The subsampling of the frequent words improves the training speed several times and makes the word representations significantly more accurate.<br />
<br />
<center><br />
[[File:Tb_1.PNG | frame | center |Table 1. Accuracy of various Skip-gram 300-dimensional models on the analogical reasoning task as defined in <ref name=MiT></ref>. NEG-''k'' stands for Negative Sampling with ''k'' negative samples for each positive sample; NCE stands for Noise Contrastive Estimation and HS-Huffman stands for the Hierarchical Softmax with the frequency-based Huffman codes. ]]<br />
</center><br />
<br />
=Learning Phrases=<br />
<br />
Many phrases have a meaning that is not a simple composition of the meanings of its individual words. To learn vector representation for phrases, we first find words that appear frequently together, and infrequently in other contexts. For example, “''New York Times''” and “''Toronto Maple Leafs''” are replaced by unique tokens in the training data, while a bigram “''this is''” will remain unchanged. This way, we can form many reasonable phrases without greatly increasing the size of the vocabulary; in theory, we can train the Skip-gram model using all n-grams, but that would be too memory intensive. A simple data-driven approach, where phrases are formed based on the unigram and bigram counts is applied to identify the phrases. In this approach, a ''score'' is calculated as:<br />
<br />
<math><br />
score(w_i,w_j)=\frac{count(w_iw_j)-\delta}{count(w_i)count(w_j)}<br />
</math><br />
<br />
The <math>\delta</math> is used as a discounting coefficient and prevents too many phrases consisting of very infrequent words to be formed. The bigrams with ''scores'' above the chosen threshold are then used as phrases. The quality of the phrase representations is evaluated using a new analogical reasoning task that involves phrases. Table 2 shows examples of the five categories of analogies used in this task.<br />
<br />
<center><br />
[[File:Tb_2.PNG | frame | center |Table 2. Examples of the analogical reasoning task for phrases (the full test set has 3218 examples). The goal is to compute the fourth phrase using the first three. Our best model achieved an accuracy of 72% on this dataset.]]<br />
</center><br />
<br />
==Phrase Skip-Gram Results==<br />
<br />
First, the phrase based training corpus is constructed and then Skip-gram models are trained using different hyperparameters. Table 3 shows the results using vector dimensionality 300 and context size 5. This setting already achieves good performance on the phrase dataset, and allowed us to quickly compare the Negative Sampling and the Hierarchical Softmax, both with and without subsampling of the frequent tokens. The results show that while Negative Sampling achieves a respectable accuracy even with ''k = 5'', using ''k = 15'' achieves considerably better performance. Also, the subsampling can result in faster training and can also improve accuracy, at least in some cases.<br />
<br />
<center><br />
[[File:Tb_3.PNG | frame | center |Table 3. Accuracies of the Skip-gram models on the phrase analogy dataset. The models were trained on approximately one billion words from the news dataset.]]<br />
</center><br />
<br />
<br />
The amount of the training data was increased to 33 billion words in order to maximize the accuracy on the phrase analogy task. Hierarchical softmax, dimensionality of 1000, and the entire sentence for the context were used. This resulted in a model that reached an accuracy of 72%. Reducing the size of the training dataset to 6 billion caused lower accuracy (66%), which suggests that large amount of the training data is crucial. To gain further insight into how different the representations learned by different models are, nearest neighbors of infrequent phrases were inspected manually using various models. In Table 4 shows a sample of such comparison. Consistently with the previous results, it seems that the best representations of phrases are learned by a model with the hierarchical softmax and subsampling.<br />
<br />
<center><br />
[[File:Tb_4.PNG | frame | center |Table 4. Examples of the closest entities to the given short phrases, using two different models.]]<br />
</center><br />
<br />
=Additive Compositionality=<br />
<br />
The word and phrase representations learned by the Skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetics. Also, the Skip-gram representations exhibit another kind of linear structure that makes it possible to meaningfully combine words by an element-wise addition of their vector representations. This phenomenon is illustrated in Table 5. The additive property of the vectors can be explained by inspecting the training objective. The word vectors are in a linear relationship with the inputs to the softmax nonlinearity. As the word vectors are trained to predict the surrounding words in the sentence, the vectors can be seen as representing the distribution of the context in which a word appears. These values are related logarithmically to the probabilities computed by the output layer, so the sum of two word vectors is related to the product of the two context distributions. The product works here as the AND function: words that are assigned high probabilities by both word vectors will have high probability, and the other words will have low probability.<br />
<br />
<center><br />
[[File:Tb_5.PNG | frame | center |Table 5. Vector compositionality using element-wise addition. Four closest tokens to the sum of two vectors are shown, using the best Skip-gram model.]]<br />
</center><br />
<br />
=Comparison to Published Word Representations=<br />
<br />
Table 6 shows the empirical comparison between different neural network-based representations of words by showing the nearest neighbors of infrequent words. These examples show that the big Skip-gram model trained on a large corpus visibly outperforms all the other models in the quality of the learned representations. This can be attributed in part to the fact that this model has been trained on about 30 billion words, which is about two to three orders of magnitude more data than the typical size used in the prior work. Interestingly, although the training set is much larger, the training time of the Skip-gram model is just a fraction of the time complexity required by the previous model architectures.<br />
<br />
<center><br />
[[File:Tb_6.PNG | frame | center |Table 6. Examples of the closest tokens given various well-known models and the Skip-gram model trained on phrases using over 30 billion training words. An empty cell means that the word was not in the vocabulary.]]<br />
</center><br />
<br />
=Conclusion=<br />
<br />
This work has the following key contributions:<br />
<br />
1. This work shows how to train distributed representations of words and phrases with the Skip-gram model and demonstrate that these representations exhibit linear structure that makes precise analogical reasoning possible.<br />
<br />
2. It is a computationally efficient model architecture which results in successfully train models on several orders of magnitude more data than the previously published models.<br />
<br />
3. Introducing Negative sampling algorithm, which is an extremely simple training method that learns accurate representations especially for frequent words.<br />
<br />
4. The choice of the training algorithm and the hyper-parameter selection is a task specific decision. It is shown that the most crucial decisions that affect the performance are the choice of the model architecture, the size of the vectors, the subsampling rate, and the size of the training window.<br />
<br />
5. The word vectors can be meaningfully combined using just simple vector addition. Another approach for learning representations of phrases presented in this paper is to simply represent the phrases with a single token. Combining these two approaches gives a powerful yet simple way how to represent longer pieces of text, while having minimal computational complexity.<br />
<br />
= Recursive Autoencoder =<br />
<br />
This is taken from paper 'Semi-supervised recursive autoencoders for predicting sentiment distributions'.<ref> Socher, et al. [http://www.socher.org/uploads/Main/SocherPenningtonHuangNgManning_EMNLP2011.pdf] </ref><br />
=== Other techniques for sentence representation ===<br />
<br />
The idea of Recursive Autoencoder is summarized in the figure below. The example illustrates the recursive autoencoder to a binary tree.<br />
<center><br />
[[File:Recur_auto.png]]<br />
</center><br />
<br />
Assume given a list of word vectors <math> x = (x_1, ..., x_m)</math>, we need to branch triplets of parents with children: <math> (y_1 \rightarrow x_3x_4), (y_2 \rightarrow x_2y_1), (y_3 \rightarrow x_1y_2) </math>.<br />
<br />
The first parent <math> y_1 </math> is computed from the children <math> (c_1, c_2) = (x_3, x_4)</math>: <math> p=f(W^{(1)}[c_1; c_2] + b^{(1)})</math> , where W is the parameter matrix and b is bias term. <br />
<br />
The autoencoder comes in by reconstructing children set <math> [c_1^'; c_2^'] = W^{(2)}p + b^{(2)}</math>. The object of this method is to minimized the MSE of original children set and the reconstructed children set.<br />
<br />
=Resources=<br />
<br />
The code for training the word and phrase vectors based on this paper is available in the open source project [https://code.google.com/p/word2vec/ word2vec]. This project also contains a set of pre-trained 300-dimensional vectors for 3 million words and phrases.<br />
<br />
=References=<br />
<references /></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Recur-auto.png&diff=27300File:Recur-auto.png2015-12-14T19:20:43Z<p>Lruan: </p>
<hr />
<div></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=distributed_Representations_of_Words_and_Phrases_and_their_Compositionality&diff=27299distributed Representations of Words and Phrases and their Compositionality2015-12-14T19:19:54Z<p>Lruan: </p>
<hr />
<div>= Introduction =<br />
<br />
This paper<ref><br />
Mikolov, Tomas, et al. [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf "Distributed representations of words and phrases and their compositionality."] Advances in neural information processing systems. 2013.<br />
</ref> presents several extensions of the Skip-gram model introduced by Mikolov et al. <ref name=MiT> Mikolov, Tomas, ''et al'' [http://arxiv.org/pdf/1301.3781v3.pdf"Efficient Estimation of Word Representations in Vector Space"] in ICLR Workshop, (2013). </ref>. The Skip-gram model is an efficient method for learning high-quality vector representations of words from large amounts of unstructured text data. The word representations computed using this model are very interesting because the learned vectors explicitly encode many linguistic regularities and patterns. Somewhat surprisingly, many of these patterns can be represented as linear translations. For example, the result of a vector calculation vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector. The authors of this paper show that subsampling of frequent words during training results in a significant speedup and improves accuracy of the representations of less frequent words. In addition, a simplified variant of Noise Contrastive Estimation (NCE) <ref name=GuM><br />
Gutmann, Michael U, ''et al'' [http://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann12JMLR.pdf"Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics"] in The Journal ofMachine Learning Research, (2012).<br />
</ref>. for training the Skip-gram model is presented that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax that was used in the prior work <ref name=MiT></ref>. It also shows that a non-obvious degree of language understanding can be obtained by using basic mathematical operations on the word vector representations. For example, vec(“Russia”) + vec(“river”) is close to vec(“Volga River”), and vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”).<br />
<br />
= The Skip-gram Model =<br />
<br />
The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document. More formally, given a sequence of training words <math>w_1, w_2,..., w_T</math> the objective of the Skip-gram model is to maximize the average log probability:<br />
<br />
<math><br />
\frac{1}{T} \sum_{t=1}^{T} \sum_{-c\leq j\leq c} log(p(w_{t+j}|w_t))<br />
</math><br />
<br /><br />
<br /><br />
where <math>c</math> is the size of the training context (which can be a function of the center word <math>w_t</math>) and <math>p(w_{t+j}|w_t)</math> is defined using softmax function:<br />
<br />
<math><br />
p(w_O|w_I) = \frac{exp ({v'_{W_O}}^T v_{W_I})}{\sum{w=1}^{W} exp ({v'_{W}}^T v_{W_I})}<br />
</math><br />
<br />
Here, <math>v_w</math> and <math>v'_w</math> are the “''input''” and “''output''” vector representations of <math>w</math>, and <math>W</math> is the number of words in the vocabulary.<br />
<br />
== Hierarchical Softmax ==<br />
<br />
Hierarchical Softmax is a computationally efficient approximation of the full softmax <ref name=MoF><br />
Morin, Frederic, ''et al'' [http://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf"Hierarchical probabilistic neural network language model"] in Proceedings of the international workshop on artificial intelligence and statistics, (2015).<br />
</ref>. Hierarchical Softmax evaluate only about <math>log_2(W)</math> output nodes instead of evaluating <math>W</math> nodes in the neural network to obtain the probability distribution.<br />
<br />
The hierarchical softmax uses a binary tree representation of the output layer with the <math>W</math> words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes. These define a random walk that assigns probabilities to words.<br />
<br />
Let <math>n(w,j)</math> be the <math>j^{th}</math> node on the path from the root to <math>w</math>, and let <math>L(w)</math> be the length of this path, so <math>n(w,1) = root</math> and <math>n(w,L(w)) = w</math>. In addition, for any inner node <math>n</math>, let <math>ch(n)</math> be an arbitrary fixed child of <math>n</math> and let <math>[[x]]</math> be 1 if <math>x</math> is true and -1 otherwise. Then the hierarchical softmax defines <math>p(w_O|w_I )</math> as follows:<br />
<br />
<math><br />
p(w|w_I) = \prod_{j=1}^{L(w)-1} \sigma ([[n(w,j+1)=ch(n(w,j))]]{v'_{n(w,j)}}^T v_{W_I}) <br />
</math><br />
<br />
where<br />
<br />
<math><br />
\sigma (x)=\frac{1}{1+exp(-x)}<br />
</math><br />
<br />
In this paper, a binary Huffman tree is used as the structure for the hierarchical softmax because it assigns short codes to the frequent words which results in fast training. It has been observed before that grouping words together by their frequency works well as a very simple speedup technique for the neural network based language models <ref name=MiT></ref><ref name=MiT2><br />
Mikolov, Tomas, ''et al'' [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5947611"Extensions of recurrent neural network language model."] in Acoustics, Speech and Signal Processing (ICASSP), (2011).<br />
</ref>.<br />
<br />
== Negative Sampling==<br />
<br />
Noise Contrastive Estimation (NCE) is an alternative to the hierarchical softmax. NCE indicates that a good model should be able to differentiate data from noise by means of logistic regression. While NCE can be shown to approximately maximize the log probability of the softmax, the Skipgram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality. Negative sampling (NEG) is defined by the objective:<br />
<br />
<math><br />
log \sigma ({v'_{W_O}}^T v_{W_I})+\sum_{i=1}^{k} \mathbb{E}_{w_i\sim P_n(w)}[log \sigma ({-v'_{W_i}}^T v_{W_I})]<br />
</math><br />
<br />
The main difference between the Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples. And while NCE approximately maximizes the log probability of the softmax, this property is not important for our application.<br /><br />
Both NCE and NEG have the noise distribution <math>P_n(w)</math> as a free parameter. We investigated a number of choices for <math>P_n(w)</math> and found that the unigram distribution <math>U(w)</math> raised to the 3/4rd power (i.e., <math>U(w)^{3/4}/Z)</math> outperformed significantly the unigram and the uniform distributions, for both NCE and NEG on every task we tried including language modeling.<br />
<br />
==Subsampling of Frequent Words==<br />
In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g., “in”, “the”, and “a”). Such words usually provide less information about the surrounding words that rarer words (i.e., "the" provides little information about the next word because it co-occurs with a huge number of words), and the representation of the frequent word will be unlikely to change significantly after many iterations. <br />
<br />
To counter the imbalance between the rare and frequent words, a simple subsampling approach is used. Each word <math>w_i</math> in the training set is discarded with probability computed by the formula:<br />
<br />
<math><br />
P(w_i)=1-\sqrt{\frac{1}{f(w_i)}}<br />
</math><br />
<br />
where <math>f(w_i)</math> is the frequency of word <math>w_i</math> and <math>t</math> is a chosen threshold, typically around <math>10^{−5}</math>.<br />
<br />
= Empirical Results=<br />
<br />
The Hierarchical Softmax (HS), Noise Contrastive Estimation, Negative Sampling, and subsampling of the training words are evaluated with the help of the analogical reasoning task1 <ref name=MiT></ref>. The task consists of analogies such as “Germany” : “Berlin” :: “France” : ?, which are solved by finding a vector ''x'' such that vec(''x'') is closest to vec(“Berlin”) - vec(“Germany”) + vec(“France”) according to the cosine distance. This specific example is considered to have been answered correctly if ''x'' is “Paris”. The task has two broad categories: the syntactic analogies (such as “quick” : “quickly” :: “slow” : “slowly”) and the semantic analogies, such as the country to capital city relationship.<br />
<br />
For training the Skip-gram models, a large dataset consisting of various news articles is used (an internal Google dataset with one billion words). All words that occurred less than 5 times in the training data were discarded, which resulted in a vocabulary of size 692K. The performance of various Skip-gram models on the word analogy test set is reported in Table 1. The table shows that Negative Sampling outperforms the Hierarchical Softmax on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation. The subsampling of the frequent words improves the training speed several times and makes the word representations significantly more accurate.<br />
<br />
<center><br />
[[File:Tb_1.PNG | frame | center |Table 1. Accuracy of various Skip-gram 300-dimensional models on the analogical reasoning task as defined in <ref name=MiT></ref>. NEG-''k'' stands for Negative Sampling with ''k'' negative samples for each positive sample; NCE stands for Noise Contrastive Estimation and HS-Huffman stands for the Hierarchical Softmax with the frequency-based Huffman codes. ]]<br />
</center><br />
<br />
=Learning Phrases=<br />
<br />
Many phrases have a meaning that is not a simple composition of the meanings of its individual words. To learn vector representation for phrases, we first find words that appear frequently together, and infrequently in other contexts. For example, “''New York Times''” and “''Toronto Maple Leafs''” are replaced by unique tokens in the training data, while a bigram “''this is''” will remain unchanged. This way, we can form many reasonable phrases without greatly increasing the size of the vocabulary; in theory, we can train the Skip-gram model using all n-grams, but that would be too memory intensive. A simple data-driven approach, where phrases are formed based on the unigram and bigram counts is applied to identify the phrases. In this approach, a ''score'' is calculated as:<br />
<br />
<math><br />
score(w_i,w_j)=\frac{count(w_iw_j)-\delta}{count(w_i)count(w_j)}<br />
</math><br />
<br />
The <math>\delta</math> is used as a discounting coefficient and prevents too many phrases consisting of very infrequent words to be formed. The bigrams with ''scores'' above the chosen threshold are then used as phrases. The quality of the phrase representations is evaluated using a new analogical reasoning task that involves phrases. Table 2 shows examples of the five categories of analogies used in this task.<br />
<br />
<center><br />
[[File:Tb_2.PNG | frame | center |Table 2. Examples of the analogical reasoning task for phrases (the full test set has 3218 examples). The goal is to compute the fourth phrase using the first three. Our best model achieved an accuracy of 72% on this dataset.]]<br />
</center><br />
<br />
==Phrase Skip-Gram Results==<br />
<br />
First, the phrase based training corpus is constructed and then Skip-gram models are trained using different hyperparameters. Table 3 shows the results using vector dimensionality 300 and context size 5. This setting already achieves good performance on the phrase dataset, and allowed us to quickly compare the Negative Sampling and the Hierarchical Softmax, both with and without subsampling of the frequent tokens. The results show that while Negative Sampling achieves a respectable accuracy even with ''k = 5'', using ''k = 15'' achieves considerably better performance. Also, the subsampling can result in faster training and can also improve accuracy, at least in some cases.<br />
<br />
<center><br />
[[File:Tb_3.PNG | frame | center |Table 3. Accuracies of the Skip-gram models on the phrase analogy dataset. The models were trained on approximately one billion words from the news dataset.]]<br />
</center><br />
<br />
<br />
The amount of the training data was increased to 33 billion words in order to maximize the accuracy on the phrase analogy task. Hierarchical softmax, dimensionality of 1000, and the entire sentence for the context were used. This resulted in a model that reached an accuracy of 72%. Reducing the size of the training dataset to 6 billion caused lower accuracy (66%), which suggests that large amount of the training data is crucial. To gain further insight into how different the representations learned by different models are, nearest neighbors of infrequent phrases were inspected manually using various models. In Table 4 shows a sample of such comparison. Consistently with the previous results, it seems that the best representations of phrases are learned by a model with the hierarchical softmax and subsampling.<br />
<br />
<center><br />
[[File:Tb_4.PNG | frame | center |Table 4. Examples of the closest entities to the given short phrases, using two different models.]]<br />
</center><br />
<br />
=Additive Compositionality=<br />
<br />
The word and phrase representations learned by the Skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetics. Also, the Skip-gram representations exhibit another kind of linear structure that makes it possible to meaningfully combine words by an element-wise addition of their vector representations. This phenomenon is illustrated in Table 5. The additive property of the vectors can be explained by inspecting the training objective. The word vectors are in a linear relationship with the inputs to the softmax nonlinearity. As the word vectors are trained to predict the surrounding words in the sentence, the vectors can be seen as representing the distribution of the context in which a word appears. These values are related logarithmically to the probabilities computed by the output layer, so the sum of two word vectors is related to the product of the two context distributions. The product works here as the AND function: words that are assigned high probabilities by both word vectors will have high probability, and the other words will have low probability.<br />
<br />
<center><br />
[[File:Tb_5.PNG | frame | center |Table 5. Vector compositionality using element-wise addition. Four closest tokens to the sum of two vectors are shown, using the best Skip-gram model.]]<br />
</center><br />
<br />
=Comparison to Published Word Representations=<br />
<br />
Table 6 shows the empirical comparison between different neural network-based representations of words by showing the nearest neighbors of infrequent words. These examples show that the big Skip-gram model trained on a large corpus visibly outperforms all the other models in the quality of the learned representations. This can be attributed in part to the fact that this model has been trained on about 30 billion words, which is about two to three orders of magnitude more data than the typical size used in the prior work. Interestingly, although the training set is much larger, the training time of the Skip-gram model is just a fraction of the time complexity required by the previous model architectures.<br />
<br />
<center><br />
[[File:Tb_6.PNG | frame | center |Table 6. Examples of the closest tokens given various well-known models and the Skip-gram model trained on phrases using over 30 billion training words. An empty cell means that the word was not in the vocabulary.]]<br />
</center><br />
<br />
=Conclusion=<br />
<br />
This work has the following key contributions:<br />
<br />
1. This work shows how to train distributed representations of words and phrases with the Skip-gram model and demonstrate that these representations exhibit linear structure that makes precise analogical reasoning possible.<br />
<br />
2. It is a computationally efficient model architecture which results in successfully train models on several orders of magnitude more data than the previously published models.<br />
<br />
3. Introducing Negative sampling algorithm, which is an extremely simple training method that learns accurate representations especially for frequent words.<br />
<br />
4. The choice of the training algorithm and the hyper-parameter selection is a task specific decision. It is shown that the most crucial decisions that affect the performance are the choice of the model architecture, the size of the vectors, the subsampling rate, and the size of the training window.<br />
<br />
5. The word vectors can be meaningfully combined using just simple vector addition. Another approach for learning representations of phrases presented in this paper is to simply represent the phrases with a single token. Combining these two approaches gives a powerful yet simple way how to represent longer pieces of text, while having minimal computational complexity.<br />
<br />
= Recursive Autoencoder =<br />
<br />
This is taken from paper 'Semi-supervised recursive autoencoders for predicting sentiment distributions'.<ref> Socher, et al. [http://www.socher.org/uploads/Main/SocherPenningtonHuangNgManning_EMNLP2011.pdf] </ref><br />
=== Other techniques for sentence representation ===<br />
<br />
The idea of Recursive Autoencoder is summarized in the figure below. The example illustrates the recursive autoencoder to a binary tree.<br />
<br />
<br />
Assume given a list of word vectors <math> x = (x_1, ..., x_m)</math>, we need to branch triplets of parents with children: <math> (y_1 \rightarrow x_3x_4), (y_2 \rightarrow x_2y_1), (y_3 \rightarrow x_1y_2) </math>.<br />
<br />
The first parent <math> y_1 </math> is computed from the children <math> (c_1, c_2) = (x_3, x_4)</math>: <math> p=f(W^{(1)}[c_1; c_2] + b^{(1)})</math> , where W is the parameter matrix and b is bias term. <br />
<br />
The autoencoder comes in by reconstructing children set <math> [c_1^'; c_2^'] = W^{(2)}p + b^{(2)}</math>. The object of this method is to minimized the MSE of original children set and the reconstructed children set.<br />
<br />
=Resources=<br />
<br />
The code for training the word and phrase vectors based on this paper is available in the open source project [https://code.google.com/p/word2vec/ word2vec]. This project also contains a set of pre-trained 300-dimensional vectors for 3 million words and phrases.<br />
<br />
=References=<br />
<references /></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=distributed_Representations_of_Words_and_Phrases_and_their_Compositionality&diff=27298distributed Representations of Words and Phrases and their Compositionality2015-12-14T19:17:40Z<p>Lruan: </p>
<hr />
<div>= Introduction =<br />
<br />
This paper<ref><br />
Mikolov, Tomas, et al. [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf "Distributed representations of words and phrases and their compositionality."] Advances in neural information processing systems. 2013.<br />
</ref> presents several extensions of the Skip-gram model introduced by Mikolov et al. <ref name=MiT> Mikolov, Tomas, ''et al'' [http://arxiv.org/pdf/1301.3781v3.pdf"Efficient Estimation of Word Representations in Vector Space"] in ICLR Workshop, (2013). </ref>. The Skip-gram model is an efficient method for learning high-quality vector representations of words from large amounts of unstructured text data. The word representations computed using this model are very interesting because the learned vectors explicitly encode many linguistic regularities and patterns. Somewhat surprisingly, many of these patterns can be represented as linear translations. For example, the result of a vector calculation vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector. The authors of this paper show that subsampling of frequent words during training results in a significant speedup and improves accuracy of the representations of less frequent words. In addition, a simplified variant of Noise Contrastive Estimation (NCE) <ref name=GuM><br />
Gutmann, Michael U, ''et al'' [http://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann12JMLR.pdf"Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics"] in The Journal ofMachine Learning Research, (2012).<br />
</ref>. for training the Skip-gram model is presented that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax that was used in the prior work <ref name=MiT></ref>. It also shows that a non-obvious degree of language understanding can be obtained by using basic mathematical operations on the word vector representations. For example, vec(“Russia”) + vec(“river”) is close to vec(“Volga River”), and vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”).<br />
<br />
= The Skip-gram Model =<br />
<br />
The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document. More formally, given a sequence of training words <math>w_1, w_2,..., w_T</math> the objective of the Skip-gram model is to maximize the average log probability:<br />
<br />
<math><br />
\frac{1}{T} \sum_{t=1}^{T} \sum_{-c\leq j\leq c} log(p(w_{t+j}|w_t))<br />
</math><br />
<br /><br />
<br /><br />
where <math>c</math> is the size of the training context (which can be a function of the center word <math>w_t</math>) and <math>p(w_{t+j}|w_t)</math> is defined using softmax function:<br />
<br />
<math><br />
p(w_O|w_I) = \frac{exp ({v'_{W_O}}^T v_{W_I})}{\sum{w=1}^{W} exp ({v'_{W}}^T v_{W_I})}<br />
</math><br />
<br />
Here, <math>v_w</math> and <math>v'_w</math> are the “''input''” and “''output''” vector representations of <math>w</math>, and <math>W</math> is the number of words in the vocabulary.<br />
<br />
== Hierarchical Softmax ==<br />
<br />
Hierarchical Softmax is a computationally efficient approximation of the full softmax <ref name=MoF><br />
Morin, Frederic, ''et al'' [http://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf"Hierarchical probabilistic neural network language model"] in Proceedings of the international workshop on artificial intelligence and statistics, (2015).<br />
</ref>. Hierarchical Softmax evaluate only about <math>log_2(W)</math> output nodes instead of evaluating <math>W</math> nodes in the neural network to obtain the probability distribution.<br />
<br />
The hierarchical softmax uses a binary tree representation of the output layer with the <math>W</math> words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes. These define a random walk that assigns probabilities to words.<br />
<br />
Let <math>n(w,j)</math> be the <math>j^{th}</math> node on the path from the root to <math>w</math>, and let <math>L(w)</math> be the length of this path, so <math>n(w,1) = root</math> and <math>n(w,L(w)) = w</math>. In addition, for any inner node <math>n</math>, let <math>ch(n)</math> be an arbitrary fixed child of <math>n</math> and let <math>[[x]]</math> be 1 if <math>x</math> is true and -1 otherwise. Then the hierarchical softmax defines <math>p(w_O|w_I )</math> as follows:<br />
<br />
<math><br />
p(w|w_I) = \prod_{j=1}^{L(w)-1} \sigma ([[n(w,j+1)=ch(n(w,j))]]{v'_{n(w,j)}}^T v_{W_I}) <br />
</math><br />
<br />
where<br />
<br />
<math><br />
\sigma (x)=\frac{1}{1+exp(-x)}<br />
</math><br />
<br />
In this paper, a binary Huffman tree is used as the structure for the hierarchical softmax because it assigns short codes to the frequent words which results in fast training. It has been observed before that grouping words together by their frequency works well as a very simple speedup technique for the neural network based language models <ref name=MiT></ref><ref name=MiT2><br />
Mikolov, Tomas, ''et al'' [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5947611"Extensions of recurrent neural network language model."] in Acoustics, Speech and Signal Processing (ICASSP), (2011).<br />
</ref>.<br />
<br />
== Negative Sampling==<br />
<br />
Noise Contrastive Estimation (NCE) is an alternative to the hierarchical softmax. NCE indicates that a good model should be able to differentiate data from noise by means of logistic regression. While NCE can be shown to approximately maximize the log probability of the softmax, the Skipgram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality. Negative sampling (NEG) is defined by the objective:<br />
<br />
<math><br />
log \sigma ({v'_{W_O}}^T v_{W_I})+\sum_{i=1}^{k} \mathbb{E}_{w_i\sim P_n(w)}[log \sigma ({-v'_{W_i}}^T v_{W_I})]<br />
</math><br />
<br />
The main difference between the Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples. And while NCE approximately maximizes the log probability of the softmax, this property is not important for our application.<br /><br />
Both NCE and NEG have the noise distribution <math>P_n(w)</math> as a free parameter. We investigated a number of choices for <math>P_n(w)</math> and found that the unigram distribution <math>U(w)</math> raised to the 3/4rd power (i.e., <math>U(w)^{3/4}/Z)</math> outperformed significantly the unigram and the uniform distributions, for both NCE and NEG on every task we tried including language modeling.<br />
<br />
==Subsampling of Frequent Words==<br />
In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g., “in”, “the”, and “a”). Such words usually provide less information about the surrounding words that rarer words (i.e., "the" provides little information about the next word because it co-occurs with a huge number of words), and the representation of the frequent word will be unlikely to change significantly after many iterations. <br />
<br />
To counter the imbalance between the rare and frequent words, a simple subsampling approach is used. Each word <math>w_i</math> in the training set is discarded with probability computed by the formula:<br />
<br />
<math><br />
P(w_i)=1-\sqrt{\frac{1}{f(w_i)}}<br />
</math><br />
<br />
where <math>f(w_i)</math> is the frequency of word <math>w_i</math> and <math>t</math> is a chosen threshold, typically around <math>10^{−5}</math>.<br />
<br />
= Empirical Results=<br />
<br />
The Hierarchical Softmax (HS), Noise Contrastive Estimation, Negative Sampling, and subsampling of the training words are evaluated with the help of the analogical reasoning task1 <ref name=MiT></ref>. The task consists of analogies such as “Germany” : “Berlin” :: “France” : ?, which are solved by finding a vector ''x'' such that vec(''x'') is closest to vec(“Berlin”) - vec(“Germany”) + vec(“France”) according to the cosine distance. This specific example is considered to have been answered correctly if ''x'' is “Paris”. The task has two broad categories: the syntactic analogies (such as “quick” : “quickly” :: “slow” : “slowly”) and the semantic analogies, such as the country to capital city relationship.<br />
<br />
For training the Skip-gram models, a large dataset consisting of various news articles is used (an internal Google dataset with one billion words). All words that occurred less than 5 times in the training data were discarded, which resulted in a vocabulary of size 692K. The performance of various Skip-gram models on the word analogy test set is reported in Table 1. The table shows that Negative Sampling outperforms the Hierarchical Softmax on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation. The subsampling of the frequent words improves the training speed several times and makes the word representations significantly more accurate.<br />
<br />
<center><br />
[[File:Tb_1.PNG | frame | center |Table 1. Accuracy of various Skip-gram 300-dimensional models on the analogical reasoning task as defined in <ref name=MiT></ref>. NEG-''k'' stands for Negative Sampling with ''k'' negative samples for each positive sample; NCE stands for Noise Contrastive Estimation and HS-Huffman stands for the Hierarchical Softmax with the frequency-based Huffman codes. ]]<br />
</center><br />
<br />
=Learning Phrases=<br />
<br />
Many phrases have a meaning that is not a simple composition of the meanings of its individual words. To learn vector representation for phrases, we first find words that appear frequently together, and infrequently in other contexts. For example, “''New York Times''” and “''Toronto Maple Leafs''” are replaced by unique tokens in the training data, while a bigram “''this is''” will remain unchanged. This way, we can form many reasonable phrases without greatly increasing the size of the vocabulary; in theory, we can train the Skip-gram model using all n-grams, but that would be too memory intensive. A simple data-driven approach, where phrases are formed based on the unigram and bigram counts is applied to identify the phrases. In this approach, a ''score'' is calculated as:<br />
<br />
<math><br />
score(w_i,w_j)=\frac{count(w_iw_j)-\delta}{count(w_i)count(w_j)}<br />
</math><br />
<br />
The <math>\delta</math> is used as a discounting coefficient and prevents too many phrases consisting of very infrequent words to be formed. The bigrams with ''scores'' above the chosen threshold are then used as phrases. The quality of the phrase representations is evaluated using a new analogical reasoning task that involves phrases. Table 2 shows examples of the five categories of analogies used in this task.<br />
<br />
<center><br />
[[File:Tb_2.PNG | frame | center |Table 2. Examples of the analogical reasoning task for phrases (the full test set has 3218 examples). The goal is to compute the fourth phrase using the first three. Our best model achieved an accuracy of 72% on this dataset.]]<br />
</center><br />
<br />
==Phrase Skip-Gram Results==<br />
<br />
First, the phrase based training corpus is constructed and then Skip-gram models are trained using different hyperparameters. Table 3 shows the results using vector dimensionality 300 and context size 5. This setting already achieves good performance on the phrase dataset, and allowed us to quickly compare the Negative Sampling and the Hierarchical Softmax, both with and without subsampling of the frequent tokens. The results show that while Negative Sampling achieves a respectable accuracy even with ''k = 5'', using ''k = 15'' achieves considerably better performance. Also, the subsampling can result in faster training and can also improve accuracy, at least in some cases.<br />
<br />
<center><br />
[[File:Tb_3.PNG | frame | center |Table 3. Accuracies of the Skip-gram models on the phrase analogy dataset. The models were trained on approximately one billion words from the news dataset.]]<br />
</center><br />
<br />
<br />
The amount of the training data was increased to 33 billion words in order to maximize the accuracy on the phrase analogy task. Hierarchical softmax, dimensionality of 1000, and the entire sentence for the context were used. This resulted in a model that reached an accuracy of 72%. Reducing the size of the training dataset to 6 billion caused lower accuracy (66%), which suggests that large amount of the training data is crucial. To gain further insight into how different the representations learned by different models are, nearest neighbors of infrequent phrases were inspected manually using various models. In Table 4 shows a sample of such comparison. Consistently with the previous results, it seems that the best representations of phrases are learned by a model with the hierarchical softmax and subsampling.<br />
<br />
<center><br />
[[File:Tb_4.PNG | frame | center |Table 4. Examples of the closest entities to the given short phrases, using two different models.]]<br />
</center><br />
<br />
=Additive Compositionality=<br />
<br />
The word and phrase representations learned by the Skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetics. Also, the Skip-gram representations exhibit another kind of linear structure that makes it possible to meaningfully combine words by an element-wise addition of their vector representations. This phenomenon is illustrated in Table 5. The additive property of the vectors can be explained by inspecting the training objective. The word vectors are in a linear relationship with the inputs to the softmax nonlinearity. As the word vectors are trained to predict the surrounding words in the sentence, the vectors can be seen as representing the distribution of the context in which a word appears. These values are related logarithmically to the probabilities computed by the output layer, so the sum of two word vectors is related to the product of the two context distributions. The product works here as the AND function: words that are assigned high probabilities by both word vectors will have high probability, and the other words will have low probability.<br />
<br />
<center><br />
[[File:Tb_5.PNG | frame | center |Table 5. Vector compositionality using element-wise addition. Four closest tokens to the sum of two vectors are shown, using the best Skip-gram model.]]<br />
</center><br />
<br />
=Comparison to Published Word Representations=<br />
<br />
Table 6 shows the empirical comparison between different neural network-based representations of words by showing the nearest neighbors of infrequent words. These examples show that the big Skip-gram model trained on a large corpus visibly outperforms all the other models in the quality of the learned representations. This can be attributed in part to the fact that this model has been trained on about 30 billion words, which is about two to three orders of magnitude more data than the typical size used in the prior work. Interestingly, although the training set is much larger, the training time of the Skip-gram model is just a fraction of the time complexity required by the previous model architectures.<br />
<br />
<center><br />
[[File:Tb_6.PNG | frame | center |Table 6. Examples of the closest tokens given various well-known models and the Skip-gram model trained on phrases using over 30 billion training words. An empty cell means that the word was not in the vocabulary.]]<br />
</center><br />
<br />
=Conclusion=<br />
<br />
This work has the following key contributions:<br />
<br />
1. This work shows how to train distributed representations of words and phrases with the Skip-gram model and demonstrate that these representations exhibit linear structure that makes precise analogical reasoning possible.<br />
<br />
2. It is a computationally efficient model architecture which results in successfully train models on several orders of magnitude more data than the previously published models.<br />
<br />
3. Introducing Negative sampling algorithm, which is an extremely simple training method that learns accurate representations especially for frequent words.<br />
<br />
4. The choice of the training algorithm and the hyper-parameter selection is a task specific decision. It is shown that the most crucial decisions that affect the performance are the choice of the model architecture, the size of the vectors, the subsampling rate, and the size of the training window.<br />
<br />
5. The word vectors can be meaningfully combined using just simple vector addition. Another approach for learning representations of phrases presented in this paper is to simply represent the phrases with a single token. Combining these two approaches gives a powerful yet simple way how to represent longer pieces of text, while having minimal computational complexity.<br />
<br />
= Recursive Autoencoder = <ref> Socher, et al. [http://www.socher.org/uploads/Main/SocherPenningtonHuangNgManning_EMNLP2011.pdf] </ref><br />
=== Other techniques for sentence representation ===<br />
<br />
The idea of Recursive Autoencoder is summarized in the figure below. The example illustrates the recursive autoencoder to a binary tree.<br />
<br />
<br />
Assume given a list of word vectors <math> x = (x_1, ..., x_m)</math>, we need to branch triplets of parents with children: <math> (y_1 \rightarrow x_3x_4), (y_2 \rightarrow x_2y_1), (y_3 \rightarrow x_1y_2) </math>.<br />
<br />
The first parent <math> y_1 </math> is computed from the children <math> (c_1, c_2) = (x_3, x_4)</math>: <math> p=f(W^{(1)}[c_1; c_2] + b^{(1)})</math> , where W is the parameter matrix and b is bias term. <br />
<br />
The autoencoder comes in by reconstructing children set <math> [c_1^'; c_2^'] = W^{(2)}p + b^{(2)}</math>. The object of this method is to minimized the MSE of original children set and the reconstructed children set.<br />
<br />
=Resources=<br />
<br />
The code for training the word and phrase vectors based on this paper is available in the open source project [https://code.google.com/p/word2vec/ word2vec]. This project also contains a set of pre-trained 300-dimensional vectors for 3 million words and phrases.<br />
<br />
=References=<br />
<references /></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=27228learning Phrase Representations2015-12-12T23:15:24Z<p>Lruan: /* Alternative Models */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
Because each hidden unit has separate gates, it is possible for each hidden to unit to learn to capture dependencies over different lengths of time (determined by the frequency at which its reset and updates gates are active).<br />
<br />
=Scoring Phrase Pairs with RNN Encoder–Decoder =<br />
<br />
In a commonly used statistical machine translation system (SMT), the goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes<br />
<br />
<math>p(\mathbf{f} | \mathbf{e})\propto p(\mathbf{e} | \mathbf{f})p(\mathbf{f})</math><br />
<br />
where the first term at the right hand side is called translation model and the latter language model (see, e.g., (Koehn, 2005)). In practice, however, most SMT systems model <math>\log p(\mathbf{f} | \mathbf{e})</math> as a loglinear model with additional features and corresponding weights:<br />
<br />
<math>\log p(\mathbf{f} | \mathbf{e})=\sum_{n=1}^Nw_nf_n(\mathbf{f},\mathbf{e})+\log Z(\mathbf{e})</math><br />
<br />
where <math>f_n</math> and <math>w_n</math> are the n-th feature and weight, respectively. <math>Z(\mathbf{e})</math> is a normalization constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a development set.<br />
<br />
Cho et al. propose to train the RNN Encoder–Decoder on a table of phrase pairs and use its scores as additional features in the loglinear model showed above when tuning the SMT decoder. For training the RNN-Encoder-Decoder, phrase frequency is ignored for several reasons: to reduce computation time, to ensure the model does not simply rank phrases by frequency, and because frequency information is already encoded in the features for the SMT (so it's better to not use the capacity of the RNN-Encoder-Decoder redundantly).<br />
<br />
=Alternative Models=<br />
The researchers noted a number of other potential translation models and their usability.<br />
<br />
The first model is by Schwenk and it is an application of a variant of the continuous space language model to the task of machine translation. The model is essentially a feedforward neural network with a common projection for input words encoded as bag of words vectors. Schwenk fixed the input and output sentence length and for a fixed length, the neural network attempts to estimate the probability of the output sequence of words and score potential translations. However, a major disadvantage is that the input and output length are fixed and cannot handle variable length inputs or outputs.<br />
<br />
The model figure<ref><br />
[Schwenk2012] Holger Schwenk. 2012. Continuous<br />
space translation models for phrase-based statistical<br />
machine translation. In Martin Kay and Christian<br />
Boitet, editors, Proceedings of the 24th International<br />
Conference on Computational Linguistics<br />
(COLIN), pages 1071–1080.<br />
</ref>:<br />
<br />
[[File:CONTINUOUS.PNG]]<br />
<br />
Another model, similar to Schwenk's, is by Devlin and a feedforward neural network is also used. Rather than estimating the probability of the entire output sequence of words in Schwenk's model, Devlin only estimates the probability of the next word and uses both a portion of the input sentence and a portion of the output sentence. It reported impressive improvements but similar to Schwenk, it fixes the length of input prior to training.<br />
<br />
Chandar et al. trained a feedforward neural network to learn a mapping from a bag-of-words representation of an input phrase to an output phrase.<ref><br />
Lauly, Stanislas, et al. "An autoencoder approach to learning bilingual word representations." Advances in Neural Information Processing Systems. 2014.<br />
</ref> This is closely related to both the proposed RNN Encoder–Decoder and the model<br />
proposed by Schwenk, except that their input representation of a phrase is a bag-of-words. A similar approach of using bag-of-words representations was proposed by Gao<ref><br />
Gao, Jianfeng, et al. "Learning semantic representations for the phrase translation model." arXiv preprint arXiv:1312.0482 (2013).<br />
</ref> as well. One important difference between the proposed RNN Encoder–Decoder and the above approaches is that the order of the words in source and target phrases is taken into account. The RNN Encoder–Decoder naturally distinguishes between sequences that have the same words but in a different order, whereas the aforementioned approaches effectively ignore order information.<br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. In building the model Cho et al. used baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref><br />
<br />
They tried the following combinations:<br />
1. Baseline configuration<br />
2. Baseline + RNN<br />
3. Baseline + CSLM + RNN<br />
4. Baseline + CSLM + RNN + Word penalty<br />
<br />
Results:<br />
<br />
<center><br />
[[File:encdec3.png |frame | center |Fig 3. BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks. ]]<br />
</center><br />
<br />
The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder–Decoder. This suggests that the contributions of the CSLM and the RNN Encoder– Decoder are not too correlated and that one can expect better results by improving each method independently<br />
<br />
<br />
== Word and Phrase Representations ==<br />
<br />
As the presented model maps sentences into a continuous space vector and prior continuous space language models have been known to learn semantically meaningful embeddings, one could expect this to happen for the presented model, too. This is indeed the case. When projecting to a 2D space (with Barnes-Hut-SNE), semantically similar words are clearly clustered.<br />
<br />
[[File:Fig4.png]]<br />
<br />
Phrases are also clustered capturing both semantic and syntactic structures.<br />
<br />
[[File:Fig5.png]]<br />
<br />
= References=<br />
<references /></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_difficulty_of_training_recurrent_neural_networks&diff=27024on the difficulty of training recurrent neural networks2015-11-30T22:13:55Z<p>Lruan: /* The Mechanics */</p>
<hr />
<div>= Introduction =<br />
<br />
Training Recurrent Neural Networks (RNN) is difficult, one of the most prominent problem in training RNN has been the vanishing and exploding gradient problem <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages<br />
1183–1188. IEEE, 1993.</ref> which prevents nerual networks from learning and fitting the data. In this paper the authors propose a gradient norm cliping stragtegy to deal with the exploding gradients and a soft constraint for the vanishing gradients problem.<br />
<br />
= Background =<br />
<br />
[[Image:rnn_2.png|frame|center|400px|alt=| Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>]]<br />
A generic recurrent neural network has the form:<br />
<br />
<math>x_{t} = F(x_{t -1}, u_{t}, \theta)</math><br />
<br />
Where:<br />
<br />
* <span><math>x_{t}</math>: is the state at time <math>t</math></span><br />
* <span><math>u_{t}</math>: is the input at time <math>t</math></span><br />
* <span><math>\theta</math>: is the parameters</span><br />
* <span><math>F()</math>: is the function that represents a neuron</span><br />
<br />
In the theoreical sections the authors made use of specific parameterization:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
Where:<br />
<br />
* <span><math>W_{rec}</math>: is the RNN weights matrix</span><br />
* <span><math>\sigma</math>: is an element wise function</span><br />
* <span><math>b</math>: is the bias</span><br />
* <span><math>W_{in}</math>: is the input weights matrix</span><br />
<br />
The following are gradients equations for using the Back Propagation Through Time (BPTT) algorithm, the authors rewrote the equations in order to highlight the exploding gradents problem:<br />
<br />
<math>\frac{\delta \varepsilon}{\delta \theta} = <br />
\sum_{1 \leq t \leq T} \frac{\delta \varepsilon}{\delta \theta}</math><br />
<br />
<math>\frac{\delta \varepsilon_{t}}{\delta \theta} = <br />
\sum_{1 \leq k \leq T} <br />
\left(<br />
\frac{\delta \varepsilon_{t}}{\delta x_{t}}<br />
\frac{\delta x_{t}}{\delta x_{k}}<br />
\frac{\delta^{+} x_{k}}{\delta \theta}<br />
\right)</math><br />
<br />
<math>\frac{\delta x_{t}}{\delta x_{k}} =<br />
\prod_{t \leq i \leq k} \frac{\delta x_{i}}{\delta x_{i - 1}} =<br />
\prod_{t \leq i \leq k} <br />
W^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1}))</math><br />
<br />
Where:<br />
<br />
* <span><math>\varepsilon_{t}</math>: is the error obtained from output at time <math>t</math></span><br />
* <span><math>\frac{\delta^{+} x_{k}}{\delta \theta}</math>: is the immediate partial derivative of state <math>x_{k}</math></span><br />
<br />
== Exploding and Vanishing Gradients ==<br />
=== The Mechanics ===<br />
<br />
It's known that <math> |\sigma^'(x)| </math> is bounded. Let <math>||diag(\sigma^'(x_k))|| \leq \gamma \in R</math>.<br />
<br />
The paper first proves that it is sufficient for <math> \lambda_1 < \frac{1}{\gamma} </math>, where <math> \lambda_1 </math> is the largest singular value of <math> \bold{W}_{rec} </math>, for the vanishing gradient problem to occur. The Jacobian matrix <math> \frac{\partial x_{k+1}}{\partial x_k} </math> is given by <math> \bold{ W}_{rec}^{T}diag(\sigma^'(x_k)) </math>. Then, the 2-norm of this Jacobian is bounded by the product of the norms of the two matrices. This leads to <math> \forall k, ||\frac{\partial{x_{k+1}}}{\partial x_k}|| \leq ||\bold{W}_{rec}^T||||diag(\sigma^'(x_k))|| < \frac{1}{\gamma}\gamma < 1</math><br />
<br />
Let <math>\eta \in R</math> be such that <math>\forall k, ||\frac{\partial {x_{k+1}}}{\partial x_k}|| \leq \eta < 1</math>. By induction over <math>i</math>, we can show that <math>||\frac{\partial \varepsilon_t}{\partial x_t}(\prod_{i=k}^{t-1}{\frac{\partial x_{i+1}}{\partial x_i}})|| \leq \eta^{t-k}||\frac{\partial \varepsilon_t}{\partial x_t}||</math>. Since <math> \eta < 1 </math>, as <math> t-k </math> goes larger, the gradient goes to 0.<br />
<br />
By inverting this proof, it also shows that when the largest singular value <math>\lambda_1 </math> is larger than <math> \frac{1}{\gamma}</math>, we will have exploding gradients (otherwise the long term components would vanish instead of exploding).<br />
<br />
=== From a dynamical systems perspective ===<br />
<br />
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “<math>\theta</math> changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.<br />
<br />
The authors of this paper argues, however, that crossing these bifurcation does not guarantee a sudden chage in gradients if the model state is not in the basin of an attractor. On the other hand if the model is in the basin of an attractor, crossing boundaries between basins will cause the gradients to explode.<br />
<br />
[[Image:dynamic_perspective.png|frame|center|400px|alt=|Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>]]<br />
<br />
The figure above depicts argument, where the x-axis is the parameter <math>b</math> (bias) and the y-axis is the asymptotoc state <math>x_{\infty}</math>, the plot line is the momvent of the final point attractor <math>x_{\infty}</math> as <math>b</math> changes. What this figure represents is the presence of two attractors, one emerging from <math>b_1</math> and another disappearing at <math>b_2</math>. The boundary between the two attractors is denoted with the dashed line, where the blue filled circles is Doya’s (1993) original hypothesis of exploding gradients, where a small change in <math>\theta</math> '''could''' (50% chance) cause <math>x</math> to change suddenly. Where as the unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time <math>0</math>, a small change in <math>\theta</math> would result in a sudden large change in <math>x_{t}</math>.<br />
<br />
=== From a geometric perspective ===<br />
<br />
Aside from a dynamical systems prospective to exploding and vanishing graidents, the authors also considered a geometric perspective, where a simple one hidden unit RNN was cosidered.<br />
<br />
[[Image:geometric_perspective.png|frame|center|400px|alt=|Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>]]<br />
<br />
Reusing the RNN state equation the authors defined:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
By assuming no input, with <math>b = 0</math> and initial state <math>x_{0}</math>, the equation simplifies to:<br />
<br />
<math>x_{t} = W_{rec}^{t} x_{0}</math><br />
<br />
Differentiating the above equation to the first and second order would give:<br />
<br />
<math>\frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0}</math><br />
<br />
<math>\frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0}</math><br />
<br />
Which implies if the first order derivative explodes so will the second derivative. This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away, possibly hindering the learning process. (See figure above).<br />
<br />
= Dealing with Exploding and Vanishing Gradient =<br />
<br />
== Related Work ==<br />
<br />
* <span>'''L1/L2 Regularization''': Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.</span><br />
* <span>'''Teacher Forcing''': Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.</span><br />
* <span>'''LTSM''': The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of <math>1</math>. This solution however does not deal with the exploding gradient</span><br />
* <span>'''Hessian-Free optimizer with structural damping''': Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.</span><br />
* <span>'''Echo State Networks''': <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.</span><br />
<br />
== Scaling Down the Gradients ==<br />
<br />
[[Image:gradient_clipping.png|frame|center|400px]]<br />
<br />
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.<br />
<br />
== Vanishing Gradient Regularization ==<br />
<br />
The vanishing gradient regularizer is as follows:<br />
<br />
<math>\Omega <br />
= \sum_{k} \Omega_{k} <br />
= \sum_{k} <br />
\left( <br />
\frac{<br />
\| <br />
\frac{\delta \varepsilon}{\delta x_{k + 1}} <br />
\frac{\delta x_{k + 1}}{\delta x_{k}}<br />
\|<br />
}<br />
{<br />
\|<br />
\frac{\delta \varepsilon}{\delta x_{k + 1}}<br />
\| <br />
} - 1<br />
\right)^{2}</math><br />
<br />
The vanishing gradient problem occurs when at time <math>t</math> the inputs <math>u</math> may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs <math>u_{t} \dots u_{k}</math> could be increased by increasing the norm of <math>\frac{\delta x_t}{\delta x_{t}}</math>. This imples that in order to increses the <math>\frac{\delta x_t}{\delta x_{t}}</math> norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices <math>\frac{\delta x_{k + 1}}{\delta x_{k}}</math> to preserve norm in the direction of the error <math>\frac{\delta \varepsilon}{\delta x_{k + 1}}</math>.<br />
<br />
= Experimental Results =<br />
<br />
== The Temporal Order Problem ==<br />
<br />
The authors repeated the temporal order problem as the prototypical pathological problem for validating the cliping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an <math>A</math> or a <math>B</math> symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of <math>A</math> and <math>B</math> at the end of the sequence.<br />
<br />
Three different RNN intializations were performed for the experiment:<br />
<br />
* <span>'''sigmoid unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
* <span>'''basic tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1)</math></span><br />
* <span>'''smart tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
<br />
Of the three RNN networks three different optimizer configurations were used:<br />
<br />
* <span>'''MSGD''': Mini-batch Stochastic Gradient Decent</span><br />
* <span>'''MSGD-C''': MSGD with Gradient Clipping</span><br />
* <span>'''MSGD-CR''': MSGD-C with Regularization</span><br />
<br />
Additional model parameters include:<br />
<br />
* <span>Spectral Radius of 0.95</span><br />
* <span><math>b = b_{out} = 0</math></span><br />
* <span>50 hidden neurons</span><br />
* <span>constant learning rate of 0.01</span><br />
* <span>clipping threshold of 1 (only for MSGD-C and MSGD-CR)</span><br />
* <span>regularization weight of 4 (MSGD-CR)</span><br />
<br />
The experiment was performed 5 times, from the figure below we can observe the importance of gradient cliping and the regularizer, in all cases the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding graidents correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important.<br />
<br />
[[Image:experimental_results.png|frame|center|400px]]<br />
<br />
== Other Pathological Problems ==<br />
<br />
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:<br />
<br />
[[Image:experimental_results_2.png|image]]<br />
<br />
The authors did not discuss the experimental results in detail here.<br />
<br />
= Summary =<br />
<br />
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer, their experimental results show that in all cases except for the Penn Treebank dataset, that cliping and regularizer has bested the state of the art for RNN in their respective experiment performances.</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_difficulty_of_training_recurrent_neural_networks&diff=27023on the difficulty of training recurrent neural networks2015-11-30T21:57:23Z<p>Lruan: /* The Mechanics */</p>
<hr />
<div>= Introduction =<br />
<br />
Training Recurrent Neural Networks (RNN) is difficult, one of the most prominent problem in training RNN has been the vanishing and exploding gradient problem <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages<br />
1183–1188. IEEE, 1993.</ref> which prevents nerual networks from learning and fitting the data. In this paper the authors propose a gradient norm cliping stragtegy to deal with the exploding gradients and a soft constraint for the vanishing gradients problem.<br />
<br />
= Background =<br />
<br />
[[Image:rnn_2.png|frame|center|400px|alt=| Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>]]<br />
A generic recurrent neural network has the form:<br />
<br />
<math>x_{t} = F(x_{t -1}, u_{t}, \theta)</math><br />
<br />
Where:<br />
<br />
* <span><math>x_{t}</math>: is the state at time <math>t</math></span><br />
* <span><math>u_{t}</math>: is the input at time <math>t</math></span><br />
* <span><math>\theta</math>: is the parameters</span><br />
* <span><math>F()</math>: is the function that represents a neuron</span><br />
<br />
In the theoreical sections the authors made use of specific parameterization:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
Where:<br />
<br />
* <span><math>W_{rec}</math>: is the RNN weights matrix</span><br />
* <span><math>\sigma</math>: is an element wise function</span><br />
* <span><math>b</math>: is the bias</span><br />
* <span><math>W_{in}</math>: is the input weights matrix</span><br />
<br />
The following are gradients equations for using the Back Propagation Through Time (BPTT) algorithm, the authors rewrote the equations in order to highlight the exploding gradents problem:<br />
<br />
<math>\frac{\delta \varepsilon}{\delta \theta} = <br />
\sum_{1 \leq t \leq T} \frac{\delta \varepsilon}{\delta \theta}</math><br />
<br />
<math>\frac{\delta \varepsilon_{t}}{\delta \theta} = <br />
\sum_{1 \leq k \leq T} <br />
\left(<br />
\frac{\delta \varepsilon_{t}}{\delta x_{t}}<br />
\frac{\delta x_{t}}{\delta x_{k}}<br />
\frac{\delta^{+} x_{k}}{\delta \theta}<br />
\right)</math><br />
<br />
<math>\frac{\delta x_{t}}{\delta x_{k}} =<br />
\prod_{t \leq i \leq k} \frac{\delta x_{i}}{\delta x_{i - 1}} =<br />
\prod_{t \leq i \leq k} <br />
W^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1}))</math><br />
<br />
Where:<br />
<br />
* <span><math>\varepsilon_{t}</math>: is the error obtained from output at time <math>t</math></span><br />
* <span><math>\frac{\delta^{+} x_{k}}{\delta \theta}</math>: is the immediate partial derivative of state <math>x_{k}</math></span><br />
<br />
== Exploding and Vanishing Gradients ==<br />
=== The Mechanics ===<br />
<br />
It's known that <math> |\sigma^'(x)| </math> is bounded. Let <math>||diag(\sigma^'(x_k))|| \leq \gamma \in R</math>.<br />
<br />
The paper first proves that it is sufficient for <math> \lambda_1 < \frac{1}{\gamma} </math>, where <math> \lambda_1 </math> is the largest singular value of <math> \bold{W}_{rec} </math>, for the vanishing gradient problem to occur. The Jacobian matrix <math> \frac{\partial x_{k+1}}{\partial x_k} </math> is given by <math> \bold{ W}_{rec}^{T}diag(\sigma^'(x_k)) </math>. Then, the 2-norm of this Jacobian is bounded by the product of the norms of the two matrices. This leads to <math> \forall k, ||\frac{\partial{x_{k+1}}}{\partial x_k}|| \leq ||\bold{W}_rec^T||||diag(\sigma^'(x_k))|| < \frac{1}{\gamma}\gamma < 1</math><br />
<br />
=== From a dynamical systems perspective ===<br />
<br />
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “<math>\theta</math> changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.<br />
<br />
The authors of this paper argues, however, that crossing these bifurcation does not guarantee a sudden chage in gradients if the model state is not in the basin of an attractor. On the other hand if the model is in the basin of an attractor, crossing boundaries between basins will cause the gradients to explode.<br />
<br />
[[Image:dynamic_perspective.png|frame|center|400px|alt=|Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>]]<br />
<br />
The figure above depicts argument, where the x-axis is the parameter <math>b</math> (bias) and the y-axis is the asymptotoc state <math>x_{\infty}</math>, the plot line is the momvent of the final point attractor <math>x_{\infty}</math> as <math>b</math> changes. What this figure represents is the presence of two attractors, one emerging from <math>b_1</math> and another disappearing at <math>b_2</math>. The boundary between the two attractors is denoted with the dashed line, where the blue filled circles is Doya’s (1993) original hypothesis of exploding gradients, where a small change in <math>\theta</math> '''could''' (50% chance) cause <math>x</math> to change suddenly. Where as the unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time <math>0</math>, a small change in <math>\theta</math> would result in a sudden large change in <math>x_{t}</math>.<br />
<br />
=== From a geometric perspective ===<br />
<br />
Aside from a dynamical systems prospective to exploding and vanishing graidents, the authors also considered a geometric perspective, where a simple one hidden unit RNN was cosidered.<br />
<br />
[[Image:geometric_perspective.png|frame|center|400px|alt=|Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>]]<br />
<br />
Reusing the RNN state equation the authors defined:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
By assuming no input, with <math>b = 0</math> and initial state <math>x_{0}</math>, the equation simplifies to:<br />
<br />
<math>x_{t} = W_{rec}^{t} x_{0}</math><br />
<br />
Differentiating the above equation to the first and second order would give:<br />
<br />
<math>\frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0}</math><br />
<br />
<math>\frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0}</math><br />
<br />
Which implies if the first order derivative explodes so will the second derivative. This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away, possibly hindering the learning process. (See figure above).<br />
<br />
= Dealing with Exploding and Vanishing Gradient =<br />
<br />
== Related Work ==<br />
<br />
* <span>'''L1/L2 Regularization''': Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.</span><br />
* <span>'''Teacher Forcing''': Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.</span><br />
* <span>'''LTSM''': The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of <math>1</math>. This solution however does not deal with the exploding gradient</span><br />
* <span>'''Hessian-Free optimizer with structural damping''': Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.</span><br />
* <span>'''Echo State Networks''': <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.</span><br />
<br />
== Scaling Down the Gradients ==<br />
<br />
[[Image:gradient_clipping.png|frame|center|400px]]<br />
<br />
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.<br />
<br />
== Vanishing Gradient Regularization ==<br />
<br />
The vanishing gradient regularizer is as follows:<br />
<br />
<math>\Omega <br />
= \sum_{k} \Omega_{k} <br />
= \sum_{k} <br />
\left( <br />
\frac{<br />
\| <br />
\frac{\delta \varepsilon}{\delta x_{k + 1}} <br />
\frac{\delta x_{k + 1}}{\delta x_{k}}<br />
\|<br />
}<br />
{<br />
\|<br />
\frac{\delta \varepsilon}{\delta x_{k + 1}}<br />
\| <br />
} - 1<br />
\right)^{2}</math><br />
<br />
The vanishing gradient problem occurs when at time <math>t</math> the inputs <math>u</math> may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs <math>u_{t} \dots u_{k}</math> could be increased by increasing the norm of <math>\frac{\delta x_t}{\delta x_{t}}</math>. This imples that in order to increses the <math>\frac{\delta x_t}{\delta x_{t}}</math> norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices <math>\frac{\delta x_{k + 1}}{\delta x_{k}}</math> to preserve norm in the direction of the error <math>\frac{\delta \varepsilon}{\delta x_{k + 1}}</math>.<br />
<br />
= Experimental Results =<br />
<br />
== The Temporal Order Problem ==<br />
<br />
The authors repeated the temporal order problem as the prototypical pathological problem for validating the cliping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an <math>A</math> or a <math>B</math> symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of <math>A</math> and <math>B</math> at the end of the sequence.<br />
<br />
Three different RNN intializations were performed for the experiment:<br />
<br />
* <span>'''sigmoid unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
* <span>'''basic tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1)</math></span><br />
* <span>'''smart tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
<br />
Of the three RNN networks three different optimizer configurations were used:<br />
<br />
* <span>'''MSGD''': Mini-batch Stochastic Gradient Decent</span><br />
* <span>'''MSGD-C''': MSGD with Gradient Clipping</span><br />
* <span>'''MSGD-CR''': MSGD-C with Regularization</span><br />
<br />
Additional model parameters include:<br />
<br />
* <span>Spectral Radius of 0.95</span><br />
* <span><math>b = b_{out} = 0</math></span><br />
* <span>50 hidden neurons</span><br />
* <span>constant learning rate of 0.01</span><br />
* <span>clipping threshold of 1 (only for MSGD-C and MSGD-CR)</span><br />
* <span>regularization weight of 4 (MSGD-CR)</span><br />
<br />
The experiment was performed 5 times, from the figure below we can observe the importance of gradient cliping and the regularizer, in all cases the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding graidents correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important.<br />
<br />
[[Image:experimental_results.png|frame|center|400px]]<br />
<br />
== Other Pathological Problems ==<br />
<br />
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:<br />
<br />
[[Image:experimental_results_2.png|image]]<br />
<br />
The authors did not discuss the experimental results in detail here.<br />
<br />
= Summary =<br />
<br />
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer, their experimental results show that in all cases except for the Penn Treebank dataset, that cliping and regularizer has bested the state of the art for RNN in their respective experiment performances.</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_difficulty_of_training_recurrent_neural_networks&diff=27022on the difficulty of training recurrent neural networks2015-11-30T21:40:48Z<p>Lruan: </p>
<hr />
<div>= Introduction =<br />
<br />
Training Recurrent Neural Networks (RNN) is difficult, one of the most prominent problem in training RNN has been the vanishing and exploding gradient problem <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages<br />
1183–1188. IEEE, 1993.</ref> which prevents nerual networks from learning and fitting the data. In this paper the authors propose a gradient norm cliping stragtegy to deal with the exploding gradients and a soft constraint for the vanishing gradients problem.<br />
<br />
= Background =<br />
<br />
[[Image:rnn_2.png|frame|center|400px|alt=| Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>]]<br />
A generic recurrent neural network has the form:<br />
<br />
<math>x_{t} = F(x_{t -1}, u_{t}, \theta)</math><br />
<br />
Where:<br />
<br />
* <span><math>x_{t}</math>: is the state at time <math>t</math></span><br />
* <span><math>u_{t}</math>: is the input at time <math>t</math></span><br />
* <span><math>\theta</math>: is the parameters</span><br />
* <span><math>F()</math>: is the function that represents a neuron</span><br />
<br />
In the theoreical sections the authors made use of specific parameterization:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
Where:<br />
<br />
* <span><math>W_{rec}</math>: is the RNN weights matrix</span><br />
* <span><math>\sigma</math>: is an element wise function</span><br />
* <span><math>b</math>: is the bias</span><br />
* <span><math>W_{in}</math>: is the input weights matrix</span><br />
<br />
The following are gradients equations for using the Back Propagation Through Time (BPTT) algorithm, the authors rewrote the equations in order to highlight the exploding gradents problem:<br />
<br />
<math>\frac{\delta \varepsilon}{\delta \theta} = <br />
\sum_{1 \leq t \leq T} \frac{\delta \varepsilon}{\delta \theta}</math><br />
<br />
<math>\frac{\delta \varepsilon_{t}}{\delta \theta} = <br />
\sum_{1 \leq k \leq T} <br />
\left(<br />
\frac{\delta \varepsilon_{t}}{\delta x_{t}}<br />
\frac{\delta x_{t}}{\delta x_{k}}<br />
\frac{\delta^{+} x_{k}}{\delta \theta}<br />
\right)</math><br />
<br />
<math>\frac{\delta x_{t}}{\delta x_{k}} =<br />
\prod_{t \leq i \leq k} \frac{\delta x_{i}}{\delta x_{i - 1}} =<br />
\prod_{t \leq i \leq k} <br />
W^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1}))</math><br />
<br />
Where:<br />
<br />
* <span><math>\varepsilon_{t}</math>: is the error obtained from output at time <math>t</math></span><br />
* <span><math>\frac{\delta^{+} x_{k}}{\delta \theta}</math>: is the immediate partial derivative of state <math>x_{k}</math></span><br />
<br />
== Exploding and Vanishing Gradients ==<br />
=== The Mechanics ===<br />
<br />
The paper first proves that it is sufficient for <math> \lambda_1 < \frac{1}{\gamma} </math>, where<br />
<br />
=== From a dynamical systems perspective ===<br />
<br />
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “<math>\theta</math> changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.<br />
<br />
The authors of this paper argues, however, that crossing these bifurcation does not guarantee a sudden chage in gradients if the model state is not in the basin of an attractor. On the other hand if the model is in the basin of an attractor, crossing boundaries between basins will cause the gradients to explode.<br />
<br />
[[Image:dynamic_perspective.png|frame|center|400px|alt=|Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>]]<br />
<br />
The figure above depicts argument, where the x-axis is the parameter <math>b</math> (bias) and the y-axis is the asymptotoc state <math>x_{\infty}</math>, the plot line is the momvent of the final point attractor <math>x_{\infty}</math> as <math>b</math> changes. What this figure represents is the presence of two attractors, one emerging from <math>b_1</math> and another disappearing at <math>b_2</math>. The boundary between the two attractors is denoted with the dashed line, where the blue filled circles is Doya’s (1993) original hypothesis of exploding gradients, where a small change in <math>\theta</math> '''could''' (50% chance) cause <math>x</math> to change suddenly. Where as the unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time <math>0</math>, a small change in <math>\theta</math> would result in a sudden large change in <math>x_{t}</math>.<br />
<br />
=== From a geometric perspective ===<br />
<br />
Aside from a dynamical systems prospective to exploding and vanishing graidents, the authors also considered a geometric perspective, where a simple one hidden unit RNN was cosidered.<br />
<br />
[[Image:geometric_perspective.png|frame|center|400px|alt=|Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>]]<br />
<br />
Reusing the RNN state equation the authors defined:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
By assuming no input, with <math>b = 0</math> and initial state <math>x_{0}</math>, the equation simplifies to:<br />
<br />
<math>x_{t} = W_{rec}^{t} x_{0}</math><br />
<br />
Differentiating the above equation to the first and second order would give:<br />
<br />
<math>\frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0}</math><br />
<br />
<math>\frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0}</math><br />
<br />
Which implies if the first order derivative explodes so will the second derivative. This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away, possibly hindering the learning process. (See figure above).<br />
<br />
= Dealing with Exploding and Vanishing Gradient =<br />
<br />
== Related Work ==<br />
<br />
* <span>'''L1/L2 Regularization''': Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.</span><br />
* <span>'''Teacher Forcing''': Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.</span><br />
* <span>'''LTSM''': The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of <math>1</math>. This solution however does not deal with the exploding gradient</span><br />
* <span>'''Hessian-Free optimizer with structural damping''': Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.</span><br />
* <span>'''Echo State Networks''': <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.</span><br />
<br />
== Scaling Down the Gradients ==<br />
<br />
[[Image:gradient_clipping.png|frame|center|400px]]<br />
<br />
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.<br />
<br />
== Vanishing Gradient Regularization ==<br />
<br />
The vanishing gradient regularizer is as follows:<br />
<br />
<math>\Omega <br />
= \sum_{k} \Omega_{k} <br />
= \sum_{k} <br />
\left( <br />
\frac{<br />
\| <br />
\frac{\delta \varepsilon}{\delta x_{k + 1}} <br />
\frac{\delta x_{k + 1}}{\delta x_{k}}<br />
\|<br />
}<br />
{<br />
\|<br />
\frac{\delta \varepsilon}{\delta x_{k + 1}}<br />
\| <br />
} - 1<br />
\right)^{2}</math><br />
<br />
The vanishing gradient problem occurs when at time <math>t</math> the inputs <math>u</math> may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs <math>u_{t} \dots u_{k}</math> could be increased by increasing the norm of <math>\frac{\delta x_t}{\delta x_{t}}</math>. This imples that in order to increses the <math>\frac{\delta x_t}{\delta x_{t}}</math> norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices <math>\frac{\delta x_{k + 1}}{\delta x_{k}}</math> to preserve norm in the direction of the error <math>\frac{\delta \varepsilon}{\delta x_{k + 1}}</math>.<br />
<br />
= Experimental Results =<br />
<br />
== The Temporal Order Problem ==<br />
<br />
The authors repeated the temporal order problem as the prototypical pathological problem for validating the cliping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an <math>A</math> or a <math>B</math> symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of <math>A</math> and <math>B</math> at the end of the sequence.<br />
<br />
Three different RNN intializations were performed for the experiment:<br />
<br />
* <span>'''sigmoid unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
* <span>'''basic tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1)</math></span><br />
* <span>'''smart tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
<br />
Of the three RNN networks three different optimizer configurations were used:<br />
<br />
* <span>'''MSGD''': Mini-batch Stochastic Gradient Decent</span><br />
* <span>'''MSGD-C''': MSGD with Gradient Clipping</span><br />
* <span>'''MSGD-CR''': MSGD-C with Regularization</span><br />
<br />
Additional model parameters include:<br />
<br />
* <span>Spectral Radius of 0.95</span><br />
* <span><math>b = b_{out} = 0</math></span><br />
* <span>50 hidden neurons</span><br />
* <span>constant learning rate of 0.01</span><br />
* <span>clipping threshold of 1 (only for MSGD-C and MSGD-CR)</span><br />
* <span>regularization weight of 4 (MSGD-CR)</span><br />
<br />
The experiment was performed 5 times, from the figure below we can observe the importance of gradient cliping and the regularizer, in all cases the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding graidents correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important.<br />
<br />
[[Image:experimental_results.png|frame|center|400px]]<br />
<br />
== Other Pathological Problems ==<br />
<br />
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:<br />
<br />
[[Image:experimental_results_2.png|image]]<br />
<br />
The authors did not discuss the experimental results in detail here.<br />
<br />
= Summary =<br />
<br />
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer, their experimental results show that in all cases except for the Penn Treebank dataset, that cliping and regularizer has bested the state of the art for RNN in their respective experiment performances.</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_difficulty_of_training_recurrent_neural_networks&diff=27021on the difficulty of training recurrent neural networks2015-11-30T21:34:04Z<p>Lruan: </p>
<hr />
<div>= Introduction =<br />
<br />
Training Recurrent Neural Networks (RNN) is difficult, one of the most prominent problem in training RNN has been the vanishing and exploding gradient problem <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages<br />
1183–1188. IEEE, 1993.</ref> which prevents nerual networks from learning and fitting the data. In this paper the authors propose a gradient norm cliping stragtegy to deal with the exploding gradients and a soft constraint for the vanishing gradients problem.<br />
<br />
= Background =<br />
<br />
[[Image:rnn_2.png|frame|center|400px|alt=| Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>]]<br />
A generic recurrent neural network has the form:<br />
<br />
<math>x_{t} = F(x_{t -1}, u_{t}, \theta)</math><br />
<br />
Where:<br />
<br />
* <span><math>x_{t}</math>: is the state at time <math>t</math></span><br />
* <span><math>u_{t}</math>: is the input at time <math>t</math></span><br />
* <span><math>\theta</math>: is the parameters</span><br />
* <span><math>F()</math>: is the function that represents a neuron</span><br />
<br />
In the theoreical sections the authors made use of specific parameterization:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
Where:<br />
<br />
* <span><math>W_{rec}</math>: is the RNN weights matrix</span><br />
* <span><math>\sigma</math>: is an element wise function</span><br />
* <span><math>b</math>: is the bias</span><br />
* <span><math>W_{in}</math>: is the input weights matrix</span><br />
<br />
The following are gradients equations for using the Back Propagation Through Time (BPTT) algorithm, the authors rewrote the equations in order to highlight the exploding gradents problem:<br />
<br />
<math>\frac{\delta \varepsilon}{\delta \theta} = <br />
\sum_{1 \leq t \leq T} \frac{\delta \varepsilon}{\delta \theta}</math><br />
<br />
<math>\frac{\delta \varepsilon_{t}}{\delta \theta} = <br />
\sum_{1 \leq k \leq T} <br />
\left(<br />
\frac{\delta \varepsilon_{t}}{\delta x_{t}}<br />
\frac{\delta x_{t}}{\delta x_{k}}<br />
\frac{\delta^{+} x_{k}}{\delta \theta}<br />
\right)</math><br />
<br />
<math>\frac{\delta x_{t}}{\delta x_{k}} =<br />
\prod_{t \leq i \leq k} \frac{\delta x_{i}}{\delta x_{i - 1}} =<br />
\prod_{t \leq i \leq k} <br />
W^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1}))</math><br />
<br />
Where:<br />
<br />
* <span><math>\varepsilon_{t}</math>: is the error obtained from output at time <math>t</math></span><br />
* <span><math>\frac{\delta^{+} x_{k}}{\delta \theta}</math>: is the immediate partial derivative of state <math>x_{k}</math></span><br />
<br />
== Exploding and Vanishing Gradients ==<br />
=== The Mechanics ===<br />
<br />
=== From a dynamical systems perspective ===<br />
<br />
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “<math>\theta</math> changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.<br />
<br />
The authors of this paper argues, however, that crossing these bifurcation does not guarantee a sudden chage in gradients if the model state is not in the basin of an attractor. On the other hand if the model is in the basin of an attractor, crossing boundaries between basins will cause the gradients to explode.<br />
<br />
[[Image:dynamic_perspective.png|frame|center|400px|alt=|Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>]]<br />
<br />
The figure above depicts argument, where the x-axis is the parameter <math>b</math> (bias) and the y-axis is the asymptotoc state <math>x_{\infty}</math>, the plot line is the momvent of the final point attractor <math>x_{\infty}</math> as <math>b</math> changes. What this figure represents is the presence of two attractors, one emerging from <math>b_1</math> and another disappearing at <math>b_2</math>. The boundary between the two attractors is denoted with the dashed line, where the blue filled circles is Doya’s (1993) original hypothesis of exploding gradients, where a small change in <math>\theta</math> '''could''' (50% chance) cause <math>x</math> to change suddenly. Where as the unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time <math>0</math>, a small change in <math>\theta</math> would result in a sudden large change in <math>x_{t}</math>.<br />
<br />
=== From a geometric perspective ===<br />
<br />
Aside from a dynamical systems prospective to exploding and vanishing graidents, the authors also considered a geometric perspective, where a simple one hidden unit RNN was cosidered.<br />
<br />
[[Image:geometric_perspective.png|frame|center|400px|alt=|Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>]]<br />
<br />
Reusing the RNN state equation the authors defined:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
By assuming no input, with <math>b = 0</math> and initial state <math>x_{0}</math>, the equation simplifies to:<br />
<br />
<math>x_{t} = W_{rec}^{t} x_{0}</math><br />
<br />
Differentiating the above equation to the first and second order would give:<br />
<br />
<math>\frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0}</math><br />
<br />
<math>\frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0}</math><br />
<br />
Which implies if the first order derivative explodes so will the second derivative. This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away, possibly hindering the learning process. (See figure above).<br />
<br />
= Dealing with Exploding and Vanishing Gradient =<br />
<br />
== Related Work ==<br />
<br />
* <span>'''L1/L2 Regularization''': Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.</span><br />
* <span>'''Teacher Forcing''': Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.</span><br />
* <span>'''LTSM''': The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of <math>1</math>. This solution however does not deal with the exploding gradient</span><br />
* <span>'''Hessian-Free optimizer with structural damping''': Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.</span><br />
* <span>'''Echo State Networks''': <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.</span><br />
<br />
== Scaling Down the Gradients ==<br />
<br />
[[Image:gradient_clipping.png|frame|center|400px]]<br />
<br />
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.<br />
<br />
== Vanishing Gradient Regularization ==<br />
<br />
The vanishing gradient regularizer is as follows:<br />
<br />
<math>\Omega <br />
= \sum_{k} \Omega_{k} <br />
= \sum_{k} <br />
\left( <br />
\frac{<br />
\| <br />
\frac{\delta \varepsilon}{\delta x_{k + 1}} <br />
\frac{\delta x_{k + 1}}{\delta x_{k}}<br />
\|<br />
}<br />
{<br />
\|<br />
\frac{\delta \varepsilon}{\delta x_{k + 1}}<br />
\| <br />
} - 1<br />
\right)^{2}</math><br />
<br />
The vanishing gradient problem occurs when at time <math>t</math> the inputs <math>u</math> may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs <math>u_{t} \dots u_{k}</math> could be increased by increasing the norm of <math>\frac{\delta x_t}{\delta x_{t}}</math>. This imples that in order to increses the <math>\frac{\delta x_t}{\delta x_{t}}</math> norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices <math>\frac{\delta x_{k + 1}}{\delta x_{k}}</math> to preserve norm in the direction of the error <math>\frac{\delta \varepsilon}{\delta x_{k + 1}}</math>.<br />
<br />
= Experimental Results =<br />
<br />
== The Temporal Order Problem ==<br />
<br />
The authors repeated the temporal order problem as the prototypical pathological problem for validating the cliping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an <math>A</math> or a <math>B</math> symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of <math>A</math> and <math>B</math> at the end of the sequence.<br />
<br />
Three different RNN intializations were performed for the experiment:<br />
<br />
* <span>'''sigmoid unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
* <span>'''basic tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1)</math></span><br />
* <span>'''smart tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
<br />
Of the three RNN networks three different optimizer configurations were used:<br />
<br />
* <span>'''MSGD''': Mini-batch Stochastic Gradient Decent</span><br />
* <span>'''MSGD-C''': MSGD with Gradient Clipping</span><br />
* <span>'''MSGD-CR''': MSGD-C with Regularization</span><br />
<br />
Additional model parameters include:<br />
<br />
* <span>Spectral Radius of 0.95</span><br />
* <span><math>b = b_{out} = 0</math></span><br />
* <span>50 hidden neurons</span><br />
* <span>constant learning rate of 0.01</span><br />
* <span>clipping threshold of 1 (only for MSGD-C and MSGD-CR)</span><br />
* <span>regularization weight of 4 (MSGD-CR)</span><br />
<br />
The experiment was performed 5 times, from the figure below we can observe the importance of gradient cliping and the regularizer, in all cases the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding graidents correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important.<br />
<br />
[[Image:experimental_results.png|frame|center|400px]]<br />
<br />
== Other Pathological Problems ==<br />
<br />
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:<br />
<br />
[[Image:experimental_results_2.png|image]]<br />
<br />
The authors did not discuss the experimental results in detail here.<br />
<br />
= Summary =<br />
<br />
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer, their experimental results show that in all cases except for the Penn Treebank dataset, that cliping and regularizer has bested the state of the art for RNN in their respective experiment performances.</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Learning_of_the_tissue-regulated_splicing_code&diff=26351deep Learning of the tissue-regulated splicing code2015-11-17T02:19:49Z<p>Lruan: /* Model */</p>
<hr />
<div>= Introduction =<br />
<br />
Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network<ref>https://www.cs.cmu.edu/afs/cs/academic/class/15782-f06/slides/bayesian.pdf</ref> (BNN), and Multinomial Logistic Regression<ref>https://en.wikipedia.org/wiki/Multinomial_logistic_regression</ref> (MLR). <br />
<br />
A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\Delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.<br />
<br />
= Model =<br />
<br />
The dataset consists of 11019 mouse alternative exons profiled from RNA-Seq<ref>https://en.wikipedia.org/wiki/RNA-Seq</ref> Data. Five tissue types are available, including brain, heart, kidney, liver and testis. <br />
<br />
The DNN is fully connected, with multiple layers of non-linearity consisting of hidden units. The mathematical expression of model is below:<br />
<br />
::::::: <math>{a_v}^l = f(\sum_{m}^{M^{l-1}}{\theta_{v,m}^{l}a_m^{l-1}})</math> <br />
:::::::where a is the weighted sum of outputs from the previous layer. <math>\theta_{v,m}^{l}</math> is the weights between layers. <br />
<br />
::::::: <math>f_{RELU}(z)=max(0,z)</math><br />
::::::: The RELU unit was used for all hidden units except for the first hidden layer, which uses TANH units.<br />
<br />
::::::: <math>h_k=\frac{exp(\sum_m{\theta_{k,m}^{last}a_m^{last}})}{\sum_{k'}{exp(\sum_{m}{\theta_{k',m}^{last}a_m^{last}})}}</math><br />
::::::: this is the softmax function of the last layer. <br />
<br />
The cost function we want to minimize here during training is <math>E=-\sum_a\sum_{k=1}^{C}{y_{n,k}log(h{n,k})}</math>, where <math>n</math> denotes the training example, and <math>k</math> indexes <math>C</math> classes. <br />
<br />
The identity of two tissues are then appended to the vector of outputs of the first hidden layer, together forming the input into the second hidden layer. The identity is a 1-of-5 binary variables in this case. (Demonstrated in Fig.1) The first targets for training contains three classes, which labeled as ''low'', ''medium'', ''high'' (LMH code). The second task describes the <math>\Delta PSI</math> between two tissues for a particular exon. The three classes corresponds to this task is ''decreased inclusion'', ''no change'' and ''increased inclusion'' (DNI code).Both the LMH and DNI codes are trained jointly, reusing the same hidden representations learned by the model. The DNN used backpropagation with dropout to train the data, and used different learning rates for two tasks. <br />
<br />
[[File: Modell.png]]<br />
<br />
= Performance comparison =<br />
<br />
The performance of the model was assessed using the area under the Receiver-Operating Characteristic curve (AUC) metric. This paper compared three methods through the same baseline, DNN, BNN and MLR. <br />
<br />
The result (LMH code) shows in the table below. Table 1a reports AUC for PSI predictions from the LMH code on all tissues; while 1b reports AUC evaluated on the subset of events that exhibit large tissue variability. From 1a, the performance of DNN in ''low'' and ''high'' categories are comparable with the BNN, but outperformed at the ''medium'' level. From 1b, DNN significantly outperformed BNN and MLR. In both comparison, MLR performed poorly. <br />
<br />
[[File: LMH.png]]<br />
<br />
Next, we look at how well the different methods can predict <math>\Delta PSI</math> (DNI code). DNN predicts LMH code and DNI code at the same time; while in BNN, the model can only predict LMH code. Thus, for a fair comparison. author used a MLR on the predicted outputs for each tissue pair from BNN and similarly trained MLR on the LMH outputs of the DNN. Table 2 shows that both DNN and DNN+MLR outperformed the BNN+MLR or MLR. <br />
<br />
[[File: DNI.png]]<br />
<br />
<br />
'''Why did DNN outperform?'''<br />
<br />
1. The use of tissue types as an input freature, which stringently required the model's hidden representations be in a form that can be well-modulated by information specifying the different tissue types for splicing pattern prediction. <br />
<br />
2. The model is described by thousands of hidden units and multiple layers of non-linearity. In contrast, BNN only has 30 hidden units, which may not be sufficient. <br />
<br />
3. A hyperparameter search is performed to optimize the DNN.<br />
<br />
4. The use of dropout, which contributed ~1-6% improvement in the LMH code for different tissues, and ~2-7% in the DNI code, compared with without dropout.<br />
<br />
5. Training was biased toward the tissue-specific events (by construction of minibatches).<br />
<br />
= Conclusion =<br />
<br />
This work shows that DNN can also be used in a sparse biological dataset. Furthermore, the input features can be analyzed in terms of the predictions of the model to gain some insights into the inferred tissue-regulated splicing code. This architecture can easily be extended to the case of more data from different sources.<br />
<br />
= reference =<br />
<br />
<references /></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Learning_of_the_tissue-regulated_splicing_code&diff=26350deep Learning of the tissue-regulated splicing code2015-11-17T02:18:58Z<p>Lruan: /* Model */</p>
<hr />
<div>= Introduction =<br />
<br />
Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network<ref>https://www.cs.cmu.edu/afs/cs/academic/class/15782-f06/slides/bayesian.pdf</ref> (BNN), and Multinomial Logistic Regression<ref>https://en.wikipedia.org/wiki/Multinomial_logistic_regression</ref> (MLR). <br />
<br />
A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\Delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.<br />
<br />
= Model =<br />
<br />
The dataset consists of 11019 mouse alternative exons profiled from RNA-Seq<ref>https://en.wikipedia.org/wiki/RNA-Seq</ref> Data. Five tissue types are available, including brain, heart, kidney, liver and testis. <br />
<br />
The DNN is fully connected, with multiple layers of non-linearity consisting of hidden units. The mathematical expression of model is below:<br />
<br />
::::::: <math>{a_v}^l = f(\sum_{m}^{M^{l-1}}{\theta_{v,m}^{l}a_m^{l-1}})</math>, where a is the weighted sum of outputs from the previous layer. <math>\theta_{v,m}^{l}</math> is the weights between layers. <br />
<br />
::::::: <math>f_{RELU}(z)=max(0,z)</math>, The RELU unit was used for all hidden units except for the first hidden layer, which uses TANH units.<br />
<br />
::::::: <math>h_k=\frac{exp(\sum_m{\theta_{k,m}^{last}a_m^{last}})}{\sum_{k'}{exp(\sum_{m}{\theta_{k',m}^{last}a_m^{last}})}}</math>, this is the softmax function of the last layer. <br />
<br />
The cost function we want to minimize here during training is <math>E=-\sum_a\sum_{k=1}^{C}{y_{n,k}log(h{n,k})}</math>, where <math>n</math> denotes the training example, and <math>k</math> indexes <math>C</math> classes. <br />
<br />
The identity of two tissues are then appended to the vector of outputs of the first hidden layer, together forming the input into the second hidden layer. The identity is a 1-of-5 binary variables in this case. (Demonstrated in Fig.1) The first targets for training contains three classes, which labeled as ''low'', ''medium'', ''high'' (LMH code). The second task describes the <math>\Delta PSI</math> between two tissues for a particular exon. The three classes corresponds to this task is ''decreased inclusion'', ''no change'' and ''increased inclusion'' (DNI code).Both the LMH and DNI codes are trained jointly, reusing the same hidden representations learned by the model. The DNN used backpropagation with dropout to train the data, and used different learning rates for two tasks. <br />
<br />
[[File: Modell.png]]<br />
<br />
= Performance comparison =<br />
<br />
The performance of the model was assessed using the area under the Receiver-Operating Characteristic curve (AUC) metric. This paper compared three methods through the same baseline, DNN, BNN and MLR. <br />
<br />
The result (LMH code) shows in the table below. Table 1a reports AUC for PSI predictions from the LMH code on all tissues; while 1b reports AUC evaluated on the subset of events that exhibit large tissue variability. From 1a, the performance of DNN in ''low'' and ''high'' categories are comparable with the BNN, but outperformed at the ''medium'' level. From 1b, DNN significantly outperformed BNN and MLR. In both comparison, MLR performed poorly. <br />
<br />
[[File: LMH.png]]<br />
<br />
Next, we look at how well the different methods can predict <math>\Delta PSI</math> (DNI code). DNN predicts LMH code and DNI code at the same time; while in BNN, the model can only predict LMH code. Thus, for a fair comparison. author used a MLR on the predicted outputs for each tissue pair from BNN and similarly trained MLR on the LMH outputs of the DNN. Table 2 shows that both DNN and DNN+MLR outperformed the BNN+MLR or MLR. <br />
<br />
[[File: DNI.png]]<br />
<br />
<br />
'''Why did DNN outperform?'''<br />
<br />
1. The use of tissue types as an input freature, which stringently required the model's hidden representations be in a form that can be well-modulated by information specifying the different tissue types for splicing pattern prediction. <br />
<br />
2. The model is described by thousands of hidden units and multiple layers of non-linearity. In contrast, BNN only has 30 hidden units, which may not be sufficient. <br />
<br />
3. A hyperparameter search is performed to optimize the DNN.<br />
<br />
4. The use of dropout, which contributed ~1-6% improvement in the LMH code for different tissues, and ~2-7% in the DNI code, compared with without dropout.<br />
<br />
5. Training was biased toward the tissue-specific events (by construction of minibatches).<br />
<br />
= Conclusion =<br />
<br />
This work shows that DNN can also be used in a sparse biological dataset. Furthermore, the input features can be analyzed in terms of the predictions of the model to gain some insights into the inferred tissue-regulated splicing code. This architecture can easily be extended to the case of more data from different sources.<br />
<br />
= reference =<br />
<br />
<references /></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Learning_of_the_tissue-regulated_splicing_code&diff=26349deep Learning of the tissue-regulated splicing code2015-11-17T02:18:20Z<p>Lruan: </p>
<hr />
<div>= Introduction =<br />
<br />
Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network<ref>https://www.cs.cmu.edu/afs/cs/academic/class/15782-f06/slides/bayesian.pdf</ref> (BNN), and Multinomial Logistic Regression<ref>https://en.wikipedia.org/wiki/Multinomial_logistic_regression</ref> (MLR). <br />
<br />
A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\Delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.<br />
<br />
= Model =<br />
<br />
The dataset consists of 11019 mouse alternative exons profiled from RNA-Seq<ref>https://en.wikipedia.org/wiki/RNA-Seq</ref> Data. Five tissue types are available, including brain, heart, kidney, liver and testis. <br />
<br />
The DNN is fully connected, with multiple layers of non-linearity consisting of hidden units. The mathematical expression of model is below:<br />
<br />
::::::: <math>{a_v}^l = f(\sum_{m}^{M^{l-1}}{\theta_{v,m}^{l}a_m^{l-1}})</math>, where a is the weighted sum of outputs from the previous layer. <math>\theta_{v,m}^{l}</math> is the weights between layers. <br />
<br />
::::::: <math>f_{RELU}(z)=max(0,z)</math>, The RELU unit was used for all hidden units except for the first hidden layer, which uses TANH units.<br />
<br />
::::::: <math>h_k=\frac{exp(\sum_m{\theta_{k,m}^{last}a_m^{last}})}{\sum_{k'}{exp(\sum_{m}{\theta_{k',m}^{last}a_m^{last}})}}</math>, this is the softmax function of the last layer. <br />
<br />
The cost function we want to minimize here during training is <math>E=-\sum_a\sum_{k=1}^{C}{y_{n,k}log(h{n,k})}</math>, where <math>n</math> denotes the training example, and <math>k</math> indexes <math>C</math> classes. <br />
<br />
The identity of two tissues are then appended to the vector of outputs of the first hidden layer, together forming the input into the second hidden layer. The identity is a 1-of-5 binary variables in this case. (Demonstrated in Fig.1) The first targets for training contains three classes, which labeled as ''low'', ''medium'', ''high'' (LMH code). The second task describes the <math>\Delta PSI</math> between two tissues for a particular exon. The three classes corresponds to this task is ''decreased inclusion'', ''no change'' and ''increased inclusion'' (DNI code).Both the LMH and DNI codes are trained jointly, reusing the same hidden representations learned by the model. The DNN used backpropagation with dropout to train the data, and used different learning rates for two tasks. <br />
<br />
[[File: Modell.png]]<br />
<br />
= Performance comparison =<br />
<br />
The performance of the model was assessed using the area under the Receiver-Operating Characteristic curve (AUC) metric. This paper compared three methods through the same baseline, DNN, BNN and MLR. <br />
<br />
The result (LMH code) shows in the table below. Table 1a reports AUC for PSI predictions from the LMH code on all tissues; while 1b reports AUC evaluated on the subset of events that exhibit large tissue variability. From 1a, the performance of DNN in ''low'' and ''high'' categories are comparable with the BNN, but outperformed at the ''medium'' level. From 1b, DNN significantly outperformed BNN and MLR. In both comparison, MLR performed poorly. <br />
<br />
[[File: LMH.png]]<br />
<br />
Next, we look at how well the different methods can predict <math>\Delta PSI</math> (DNI code). DNN predicts LMH code and DNI code at the same time; while in BNN, the model can only predict LMH code. Thus, for a fair comparison. author used a MLR on the predicted outputs for each tissue pair from BNN and similarly trained MLR on the LMH outputs of the DNN. Table 2 shows that both DNN and DNN+MLR outperformed the BNN+MLR or MLR. <br />
<br />
[[File: DNI.png]]<br />
<br />
<br />
'''Why did DNN outperform?'''<br />
<br />
1. The use of tissue types as an input freature, which stringently required the model's hidden representations be in a form that can be well-modulated by information specifying the different tissue types for splicing pattern prediction. <br />
<br />
2. The model is described by thousands of hidden units and multiple layers of non-linearity. In contrast, BNN only has 30 hidden units, which may not be sufficient. <br />
<br />
3. A hyperparameter search is performed to optimize the DNN.<br />
<br />
4. The use of dropout, which contributed ~1-6% improvement in the LMH code for different tissues, and ~2-7% in the DNI code, compared with without dropout.<br />
<br />
5. Training was biased toward the tissue-specific events (by construction of minibatches).<br />
<br />
= Conclusion =<br />
<br />
This work shows that DNN can also be used in a sparse biological dataset. Furthermore, the input features can be analyzed in terms of the predictions of the model to gain some insights into the inferred tissue-regulated splicing code. This architecture can easily be extended to the case of more data from different sources.<br />
<br />
= reference =<br />
<br />
<references /></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Modell.png&diff=26344File:Modell.png2015-11-17T02:11:08Z<p>Lruan: </p>
<hr />
<div></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:LMH.png&diff=26339File:LMH.png2015-11-17T02:09:51Z<p>Lruan: </p>
<hr />
<div></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:DNI.png&diff=26338File:DNI.png2015-11-17T02:09:38Z<p>Lruan: </p>
<hr />
<div></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Learning_of_the_tissue-regulated_splicing_code&diff=26335deep Learning of the tissue-regulated splicing code2015-11-17T02:09:05Z<p>Lruan: </p>
<hr />
<div>= Introduction =<br />
<br />
Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network(ref) (BNN), and Multinomial Logistic Regression (MLR). <br />
<br />
A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\Delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.<br />
<br />
= Model =<br />
<br />
The dataset consists of 11019 mouse alternative exons profiled from RNA-Seq(ref) Data. Five tissue types are available, including brain, heart, kidney, liver and testis. <br />
<br />
The DNN is fully connected, with multiple layers of non-linearity consisting of hidden units. The mathematical expression of model is below:<br />
<br />
<math>{a_v}^l = f(\sum_{m}^{M^{l-1}}{\theta_{v,m}^{l}a_m^{l-1}})</math>, where a is the weighted sum of outputs from the previous layer. <math>\theta_{v,m}^{l}</math> is the weights between layers. <br />
<br />
<math>f_{RELU}(z)=max(0,z)</math>, The RELU unit was used for all hidden units except for the first hidden layer, which uses TANH units.<br />
<br />
<math>h_k=\frac{exp(\sum_m{\theta_{k,m}^{last}a_m^{last}})}{\sum_{k'}{exp(\sum_{m}{\theta_{k',m}^{last}a_m^{last}})}}</math>, this is the softmax function of the last layer. <br />
<br />
The cost function we want to minimize here during training is <math>E=-\sum_a\sum_{k=1}^{C}{y_{n,k}log(h{n,k})}</math>, where <math>n</math> denotes the training example, and <math>k</math> indexes <math>C</math> classes. <br />
<br />
The identity of two tissues are then appended to the vector of outputs of the first hidden layer, together forming the input into the second hidden layer. The identity is a 1-of-5 binary variables in this case. (Demonstrated in Fig.1) The first targets for training contains three classes, which labeled as ''low'', ''medium'', ''high'' (LMH code). The second task describes the <math>\Delta PSI</math> between two tissues for a particular exon. The three classes corresponds to this task is ''decreased inclusion'', ''no change'' and ''increased inclusion'' (DNI code).Both the LMH and DNI codes are trained jointly, reusing the same hidden representations learned by the model. The DNN used backpropagation with dropout to train the data, and used different learning rates for two tasks. <br />
<br />
[File: 'Model.png']<br />
<br />
= Performance comparison =<br />
<br />
The performance of the model was assessed using the area under the Receiver-Operating Characteristic curve (AUC) metric. This paper compared three methods through the same baseline, DNN, BNN and MLR. <br />
<br />
The result (LMH code) shows in the table below. Table 1a reports AUC for PSI predictions from the LMH code on all tissues; while 1b reports AUC evaluated on the subset of events that exhibit large tissue variability. From 1a, the performance of DNN in ''low'' and ''high'' categories are comparable with the BNN, but outperformed at the ''medium'' level. From 1b, DNN significantly outperformed BNN and MLR. In both comparison, MLR performed poorly. <br />
<br />
[File: 'LMH.png']<br />
<br />
Next, we look at how well the different methods can predict <math>\Delta PSI</math> (DNI code). DNN predicts LMH code and DNI code at the same time; while in BNN, the model can only predict LMH code. Thus, for a fair comparison. author used a MLR on the predicted outputs for each tissue pair from BNN and similarly trained MLR on the LMH outputs of the DNN. Table 2 shows that both DNN and DNN+MLR outperformed the BNN+MLR or MLR. <br />
<br />
[File: 'DNI.png']<br />
<br />
<br />
::Why did DNN outperform?::<br />
<br />
1. The use of tissue types as an input freature, which stringently required the model's hidden representations be in a form that can be well-modulated by information specifying the different tissue types for splicing pattern prediction. <br />
<br />
2. The model is described by thousands of hidden units and multiple layers of non-linearity. In contrast, BNN only has 30 hidden units, which may not be sufficient. <br />
<br />
3. A hyperparameter search is performed to optimize the DNN.<br />
<br />
4. The use of dropout, which contributed ~1-6% improvement in the LMH code for different tissues, and ~2-7% in the DNI code, compared with without dropout.<br />
<br />
5. Training was biased toward the tissue-specific events (by construction of minibatches).<br />
<br />
= Conclusion =<br />
<br />
This work shows that DNN can also be used in a sparse biological dataset. Furthermore, the input features can be analyzed in terms of the predictions of the model to gain some insights into the inferred tissue-regulated splicing code. This architecture can easily be extended to the case of more data from different sources.</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Learning_of_the_tissue-regulated_splicing_code&diff=26328deep Learning of the tissue-regulated splicing code2015-11-17T01:57:54Z<p>Lruan: </p>
<hr />
<div>= Introduction =<br />
<br />
Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network(ref) (BNN), and Multinomial Logistic Regression (MLR). <br />
<br />
A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\Delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.<br />
<br />
= Model =<br />
<br />
The dataset consists of 11019 mouse alternative exons profiled from RNA-Seq(ref) Data. Five tissue types are available, including brain, heart, kidney, liver and testis. <br />
<br />
The DNN is fully connected, with multiple layers of non-linearity consisting of hidden units. The mathematical expression of model is below:<br />
<br />
<math>{a_v}^l = f(\sum_{m}^{M^{l-1}}{\theta_{v,m}^{l}a_m^{l-1}})</math>, where a is the weighted sum of outputs from the previous layer. <math>\theta_{v,m}^{l}</math> is the weights between layers. <br />
<br />
<math>f_{RELU}(z)=max(0,z)</math>, The RELU unit was used for all hidden units except for the first hidden layer, which uses TANH units.<br />
<br />
<math>h_k=\frac{exp(\sum_m{\theta_{k,m}^{last}a_m^{last}})}{\sum_{k'}{exp(\sum_{m}{\theta_{k',m}^{last}a_m^{last}})}}</math>, this is the softmax function of the last layer. <br />
<br />
The cost function we want to minimize here during training is <math>E=-\sum_a\sum_{k=1}^{C}{y_{n,k}log(h{n,k})}</math>, where <math>n</math> denotes the training example, and <math>k</math> indexes <math>C</math> classes. <br />
<br />
The identity of two tissues are then appended to the vector of outputs of the first hidden layer, together forming the input into the second hidden layer. The identity is a 1-of-5 binary variables in this case. (Demonstrated in Fig.1) The first targets for training contains three classes, which labeled as ''low'', ''medium'', ''high'' (LMH code). The second task describes the <math>\Delta PSI</math> between two tissues for a particular exon. The three classes corresponds to this task is ''decreased inclusion'', ''no change'' and ''increased inclusion'' (DNI code).Both the LMH and DNI codes are trained jointly, reusing the same hidden representations learned by the model. The DNN used backpropagation with dropout to train the data, and used different learning rates for two tasks. <br />
'pics'<br />
<br />
= Performance comparison =<br />
<br />
The performance of the model was assessed using the area under the Receiver-Operating Characteristic curve (AUC) metric. This paper compared three methods through the same baseline, DNN, BNN and MLR. <br />
<br />
The result (LMH code) shows in the table below. Table 1a reports AUC for PSI predictions from the LMH code on all tissues; while 1b reports AUC evaluated on the subset of events that exhibit large tissue variability. From 1a, the performance of DNN in ''low'' and ''high'' categories are comparable with the BNN, but outperformed at the ''medium'' level. From 1b, DNN significantly outperformed BNN and MLR. In both comparison, MLR performed poorly. <br />
<br />
''pic''<br />
<br />
Next, we look at how well the different methods can predict <math>\Delta PSI</math> (DNI code). DNN predicts LMH code and DNI code at the same time; while in BNN, the model can only predict LMH code. Thus, for a fair comparison. author used a MLR on the predicted outputs for each tissue pair from BNN and similarly trained MLR on the LMH outputs of the DNN. Table 2 shows that both DNN and DNN+MLR outperformed the BNN+MLR or MLR. <br />
<br />
''pic''<br />
<br />
<br />
::Why did DNN outperform?::<br />
<br />
1. The use of tissue types as an input freature, which stringently required the model's hidden representations be in a form that can be well-modulated by information specifying the different tissue types for splicing pattern prediction. <br />
<br />
2. The model is described by thousands of hidden units and multiple layers of non-linearity. In contrast, BNN only has 30 hidden units, which may not be sufficient. <br />
<br />
3. A hyperparameter search is performed to optimize the DNN.<br />
<br />
4. The use of dropout, which contributed ~1-6% improvement in the LMH code for different tissues, and ~2-7% in the DNI code, compared with without dropout.<br />
<br />
5. Training was biased toward the tissue-specific events (by construction of minibatches).</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Learning_of_the_tissue-regulated_splicing_code&diff=26320deep Learning of the tissue-regulated splicing code2015-11-17T01:09:10Z<p>Lruan: /* Model */</p>
<hr />
<div>= Introduction =<br />
<br />
Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network(ref) (BNN) and Multinomial Logistic Regression(ref) (MLR). <br />
<br />
A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\Delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.<br />
<br />
= Model =<br />
<br />
The dataset consists of 11019 mouse alternative exons profiled from RNA-Seq(ref) Data. Five tissue types are available, including brain, heart, kidney, liver and testis. <br />
<br />
The DNN is fully connected, with multiple layers of non-linearity consisting of hidden units. The mathematical expression of model is below:<br />
<br />
<math>{a_v}^l = f(\sum_{m}^{M^{l-1}}{\theta_{v,m}^{l}a_m^{l-1}})</math>, where a is the weighted sum of outputs from the previous layer. <math>\theta_{v,m}^{l}</math> is the weights between layers. <br />
<br />
<math>f_{RELU}(z)=max(0,z)</math>, The RELU unit was used for all hidden units except for the first hidden layer, which uses TANH units.<br />
<br />
<math>h_k=\frac{exp(\sum_m{\theta_{k,m}^{last}a_m^{last}})}{\sum_{k'}{exp(\sum_{m}{\theta_{k',m}^{last}a_m^{last}})}}</math>, this is the softmax function of the last layer. <br />
<br />
The cost function we want to minimize here during training is <math>E=-\sum_a\sum_{k=1}^{C}{y_{n,k}log(h{n,k})}</math>, where <math>n</math> denotes the training example, and <math>k</math> indexes <math>C</math> classes. <br />
<br />
The identity of two tissues are then appended to the vector of outputs of the first hidden layer, together forming the input into the second hidden layer. The identity is a 1-of-5 binary variables in this case. (Demonstrated in Fig.1) The first targets for training contains three classes, which labeled as ''low'', ''medium'', ''high'' (LMH code). The second task describes the <math>\Delta PSI</math> between two tissues for a particular exon. The three classes corresponds to this task is ''decreased inclusion'', ''no change'' and ''increased inclusion'' (DNI code).Both the LMH and DNI codes are trained jointly, reusing the same hidden representations learned by the model. The DNN used backpropagation with dropout to train the data, and used different learning rates for two tasks. <br />
'pics'</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Learning_of_the_tissue-regulated_splicing_code&diff=26300deep Learning of the tissue-regulated splicing code2015-11-16T21:48:18Z<p>Lruan: /* Model */</p>
<hr />
<div>= Introduction =<br />
<br />
Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network(ref) (BNN) and Multinomial Logistic Regression(ref) (MLR). <br />
<br />
A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\Delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.<br />
<br />
= Model =<br />
<br />
The dataset consists of 11019 mouse alternative exons profiled from RNA-Seq(ref) Data. Five tissue types are available, including brain, heart, kidney, liver and testis. <br />
<br />
The DNN is fully connected, with multiple layers of non-linearity consisting of hidden units. The mathematical expression of model is below:<br />
<br />
<math>{a_v}^l = f(\sum_{m}^{M^{l-1}}{\theta_{v,m}^{l}a_m^{l-1}})</math>, where a is the weighted sum of outputs from the previous layer. <math>\theta_{v,m}^{l}</math> is the weights between layers. <br />
<br />
<math>f_{RELU}(z)=max(0,z)</math>, The RELU unit was used for all hidden units except for the first hidden layer, which uses TANH units.<br />
<br />
<math>h_k=\frac{exp(\sum_m{\theta_{k,m}^{last}a_m^{last}})}{\sum_{k'}{exp(\sum_{m}{\theta_{k',m}^{last}a_m^{last}})}}</math>, this is the softmax function of the last layer. <br />
<br />
The cost function we want to minimize here during training is <math>E=-\sum_a\sum_{k=1}^{C}{y_{n,k}log(h{n,k})}</math>, where <math>n</math> denotes the training example, and <math>k</math> indexes <math>C</math> classes. <br />
<br />
The identity of two tissues are then appended to the vector of outputs of the first hidden layer, together forming the input into the second hidden layer. The identity is a 1-of-5 binary variables in this case. (Demonstrated in Fig.1) The first targets for training contains three classes, which labeled as ''low'', ''medium'', ''high'' (LMH code). The second task describes the <math>\Delta PSI</math> between two tissues for a particular exon. The three classes corresponds to this task is ''decreased inclusion'', ''no change'' and ''increased inclusion'' (DNI code).Both the LMH and DNI codes are trained jointly, reusing the same hidden representations learned by the model. The DNN used backpropagation to train the data, and used different learning rates for gradient descent. <br />
'pics'</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Learning_of_the_tissue-regulated_splicing_code&diff=26293deep Learning of the tissue-regulated splicing code2015-11-16T21:07:58Z<p>Lruan: </p>
<hr />
<div>= Introduction =<br />
<br />
Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network(ref) (BNN) and Multinomial Logistic Regression(ref) (MLR). <br />
<br />
A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\Delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.<br />
<br />
= Model =<br />
<br />
The dataset consists of 11019 mouse alternative exons profiled from RNA-Seq(ref) Data. Five tissue types are available, including brain, heart, kidney, liver and testis. <br />
<br />
The DNN is fully connected, with multiple layers of non-linearity consisting of hidden units. The mathematical expression of model is below:<br />
<br />
<math>{a_v}^l = f(\sum_{m}^{M^{l-1}}{\theta_{v,m}^{l}a_m^{l-1}})</math>, where a is the weighted sum of outputs from the previous layer. <math>\theta_{v,m}^{l}</math> is the weights between layers. <br />
<br />
<math>f_{RELU}(z)=max(0,z)</math>, The RELU unit was used for all hidden units except for the first hidden layer, which uses TANH units.<br />
<br />
<math>h_k=\frac{exp(\sum_m{\theta_{k,m}^{last}a_m^{last}})}{\sum_{k'}{exp(\sum_{m}{\theta_{k',m}^{last}a_m^{last}})}}</math>, this is the soft function of the last layer. <br />
<br />
The cost function we want to minimize here during training is <math>E=-\sum_a\sum_{k=1}^{C}{y_{n,k}log(h{n,k})}</math>, where <math>n</math> denotes the training example, and <math>k</math> indexes <math>C</math> classes. <br />
<br />
The identity of two tissues are then appended to the vector of outputs of the first hidden layer, together forming the input into the second hidden layer. The identity is a 1-of-5 binary variables in this case. (Demonstrated in Fig.1) The first targets for training contains three classes, which labeled as ''low'', ''medium'', ''high'' (LMH code). The second task describes the <math>\Delta PSI</math> between two tissues for a particular exon. The three classes corresponds to this task is ''decreased inclusion'', ''no change'' and ''increased inclusion'' (DNI code).Both the LMH and DNI codes are trained jointly, reusing the same hidden representations learned by the model. <br />
'pics'</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Learning_of_the_tissue-regulated_splicing_code&diff=26269deep Learning of the tissue-regulated splicing code2015-11-16T15:48:26Z<p>Lruan: /* Introduction */</p>
<hr />
<div>= Introduction =<br />
<br />
Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network (BNN) and Multinomial Logistic Regression (MLR). <br />
<br />
A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\Delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Learning_of_the_tissue-regulated_splicing_code&diff=26268deep Learning of the tissue-regulated splicing code2015-11-16T15:48:01Z<p>Lruan: /* Introduction */</p>
<hr />
<div>= Introduction =<br />
<br />
Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network (BNN) and Multinomial Logistic Regression (MLR). <br />
<br />
A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Learning_of_the_tissue-regulated_splicing_code&diff=26267deep Learning of the tissue-regulated splicing code2015-11-16T15:47:16Z<p>Lruan: Created page with "= Introduction = Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combina..."</p>
<hr />
<div>= Introduction =<br />
<br />
Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network (BNN) and Multinomial Logistic Regression (MLR). <br />
A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26182dropout2015-11-13T01:53:32Z<p>Lruan: </p>
<hr />
<div>= Introduction =<br />
Dropout<ref>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</ref> is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
''' Multiplicative Gaussian Noise '''<br />
<br />
Dropout takes Bernoulli distributed random variables which take the value 1 with probability <math>p </math> and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from <math>\mathcal{N}(1, 1) </math>. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation <math>h_i </math> is perturbed to <math>h_i+h_ir </math> where <math>r \sim \mathcal{N}(0,1) </math>, which equals to <math>h_ir' </math> where <math>r' \sim \mathcal{N}(1, 1) </math>. We can generalize this to <math>r' \sim \mathcal{N}(1, \sigma^2) </math> which <math>\sigma^2</math> is a hyperparameter to tune.<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
Dropout breaks the co-adaptations between hidden units. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[File:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
[[File:Result.png]]<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.<br />
<br />
=Reference=<br />
<references /></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26180dropout2015-11-13T01:51:31Z<p>Lruan: /* Introduction */</p>
<hr />
<div>= Introduction =<br />
Dropout<ref>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</ref> is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
''' Multiplicative Gaussian Noise '''<br />
<br />
Dropout takes Bernoulli distributed random variables which take the value 1 with probability <math>p </math> and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from <math>\mathcal{N}(1, 1) </math>. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation <math>h_i </math> is perturbed to <math>h_i+h_ir </math> where <math>r \sim \mathcal{N}(0,1) </math>, which equals to <math>h_ir' </math> where <math>r' \sim \mathcal{N}(1, 1) </math>. We can generalize this to <math>r' \sim \mathcal{N}(1, \sigma^2) </math> which <math>\sigma^2</math> is a hyperparameter to tune.<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
Dropout breaks the co-adaptations between hidden units. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[File:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
[[File:Result.png]]<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26177dropout2015-11-13T01:47:43Z<p>Lruan: /* Model */</p>
<hr />
<div>= Introduction =<br />
Dropout is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
''' Multiplicative Gaussian Noise '''<br />
<br />
Dropout takes Bernoulli distributed random variables which take the value 1 with probability <math>p </math> and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from <math>\mathcal{N}(1, 1) </math>. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation <math>h_i </math> is perturbed to <math>h_i+h_ir </math> where <math>r \sim \mathcal{N}(0,1) </math>, which equals to <math>h_ir' </math> where <math>r' \sim \mathcal{N}(1, 1) </math>. We can generalize this to <math>r' \sim \mathcal{N}(1, \sigma^2) </math> which <math>\sigma^2</math> is a hyperparameter to tune.<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
Dropout breaks the co-adaptations between hidden units. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[File:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
[[File:Result.png]]<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26176dropout2015-11-13T01:34:03Z<p>Lruan: </p>
<hr />
<div>= Introduction =<br />
Dropout is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
Dropout breaks the co-adaptations between hidden units. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[File:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
[[File:Result.png]]<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26175dropout2015-11-13T01:33:35Z<p>Lruan: /* Result */</p>
<hr />
<div>= Introduction =<br />
Dropout is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
Dropout breaks the co-adaptations between hidden units. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[Figure:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
[[Figure:Result.png]]<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26174dropout2015-11-13T01:33:09Z<p>Lruan: /* Comparison */</p>
<hr />
<div>= Introduction =<br />
Dropout is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
Dropout breaks the co-adaptations between hidden units. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[Figure:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
'''pic'''<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Result.png&diff=26173File:Result.png2015-11-13T01:32:34Z<p>Lruan: </p>
<hr />
<div></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26172dropout2015-11-13T01:30:56Z<p>Lruan: /* Comparison */</p>
<hr />
<div>= Introduction =<br />
Dropout is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
Dropout breaks the co-adaptations between hidden units. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[Figure:comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
'''pic'''<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26171dropout2015-11-13T01:28:39Z<p>Lruan: </p>
<hr />
<div>= Introduction =<br />
Dropout is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
Dropout breaks the co-adaptations between hidden units. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
'''pic'''<br />
<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
'''pic'''<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26170dropout2015-11-13T01:14:52Z<p>Lruan: </p>
<hr />
<div>= Introduction =<br />
Dropout is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
Dropout breaks the co-adaptations between hidden units. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
'''pic'''<br />
<br />
<br />
= Result =</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26169dropout2015-11-13T01:10:58Z<p>Lruan: /* Effects of Dropout */</p>
<hr />
<div>= Introduction =<br />
Dropout is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
Dropout breaks the co-adaptations between hidden units. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= choice of p=<br />
= data size =<br />
= dropout RBF =</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26168dropout2015-11-13T01:09:10Z<p>Lruan: /* Model */</p>
<hr />
<div>= Introduction =<br />
Dropout is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
Dropout breaks the co-adaptations between hidden units. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
'''picture'''<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
'''picture'''<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
'''picture'''<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
'''picture'''<br />
<br />
<br />
= choice of p=<br />
= data size =<br />
= dropout RBF =</div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Test.png&diff=26167File:Test.png2015-11-13T01:08:23Z<p>Lruan: </p>
<hr />
<div></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Sparsity.png&diff=26166File:Sparsity.png2015-11-13T01:08:12Z<p>Lruan: </p>
<hr />
<div></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Pvalue.png&diff=26165File:Pvalue.png2015-11-13T01:07:56Z<p>Lruan: </p>
<hr />
<div></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Comparison.png&diff=26164File:Comparison.png2015-11-13T01:07:43Z<p>Lruan: </p>
<hr />
<div></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Feature.png&diff=26163File:Feature.png2015-11-13T01:07:31Z<p>Lruan: </p>
<hr />
<div></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Datasize.png&diff=26162File:Datasize.png2015-11-13T01:06:55Z<p>Lruan: </p>
<hr />
<div></div>Lruanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26161dropout2015-11-13T01:06:23Z<p>Lruan: </p>
<hr />
<div>= Introduction =<br />
Dropout is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math><br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition. <br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
Dropout breaks the co-adaptations between hidden units. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
'''picture'''<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
'''picture'''<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
'''picture'''<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
'''picture'''<br />
<br />
<br />
= choice of p=<br />
= data size =<br />
= dropout RBF =</div>Lruan