http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Rtwang&feedformat=atomstatwiki - User contributions [US]2023-02-02T11:27:47ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=27352on using very large target vocabulary for neural machine translation2015-12-18T02:09:13Z<p>Rtwang: </p>
<hr />
<div>==Overview==<br />
<br />
This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation"<br />
<ref>S. Jean, K. Cho, R Memisevic, and Y. Bengio. [http://arxiv.org/pdf/1412.2007v2.pdf "On Using Very Large Target Vocabulary for Neural Machine Translation"], 2015.</ref><br />
The paper presents the application of importance sampling for neural machine translation with a very large target vocabulary. Despite the advantages of neural networks in machine translation over statistical machine translation systems such as the phrase-based system, they suffer from some technical problems. Most importantly, they are limited to a small vocabulary because of complexity and number of parameters that have to be trained as total vocabulary increases. The output of a RNN used for machine translation will have as many dimensions as there are words in the vocabulary. If the total vocabulary consists of hundreds of thousand of words, then the RNN must compute a very expensive softmax on the output vector at each time step and estimate the probability of each word as the next word in the sequence. Therefore, the number of parameters in the RNN will also grow very large in such cases given that number of weights between the hidden layer and output layer will be equal to the product of the number of units in each layer. For a non-trivial sized hidden layer, a large vocabulary could result in tens of millions of model parameters purely associated with the hidden-to-output mapping. In practice, researchers who apply RNNs to machine translation have avoided this problem by restricting the model vocabulary to only include some shortlist of words in the target language. Words not in this shortlist are treated as unknown by the model and assigned a special 'UNK' token. This technique understandably impairs translation performance when the target sentence includes a large number of words not present in the vocabulary such as names. <br />
<br />
In this paper Jean and his colleagues aim to solve this problem by proposing a training method based on importance sampling which uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrates better performance without losing efficiency in time or speed. The algorithm is tested on two machine translation tasks (English <math>\rightarrow</math> German, and English <math>\rightarrow</math> French), and it achieved the best performance out of any previous single neural machine translation (NMT) system on the English <math>\rightarrow</math> French translation task.<br />
<br />
==Methods==<br />
<br />
Recall that the classic neural machine translation system works through an encoder-decoder network. The encoder reads the source sentence x and encode it into a sequence of hidden states of h where <math>h_t=f(x_t,h_{t-1})</math>. In the decoder step, another neural network generates the translation vector of y based on the encoded sequence of hidden states h: <math>p(y_t\,|\,y_{<t},x)\propto \exp\{q(y_{t-1}, z_t, c_t)\}</math> where <math>\, z_t=g(y_{t-1}, z_{t-1}, c_t)</math> and <math>\, c_t=r(z_{t-1}, h_1,..., H_T)</math><br />
<br />
The objective function which have to be maximized represented by <br />
<math>\theta=\arg\max\sum_{n=1}^{N}\sum_{t=1}^{T_n}\log p(y_t^n\,|\,y_{<t}^n, x^n)</math><br />
<br />
where <math>(x^n, y^n)</math> is the n-th training pair of sentence, and <math>T_n</math> is the length of n-th target sentence <math>y^n</math>.<br />
The proposed model is based on specific implementation of neural machine translation that uses an attention mechanism, as recently proposed in <ref><br />
Bahdanau et al.,[http://arxiv.org/pdf/1409.0473v6.pdf NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE], 2014<br />
</ref>.<br />
In that the encoder is implemented by a bi-directional recurrent neural network,<math>h_t=[h_t^\leftarrow; h_t^\rightarrow]</math>. The decoder, at each time, computes the context<br />
vector <math>c_t</math> as a convex sum of the hidden states <math>(h_1,...,h_T)</math> with the coefficients <math>(\alpha_1,...,\alpha_T)</math> computed by<br />
<br />
<math>\alpha_t=\frac{\exp\{a(h_t, z_t)\}}{\sum_{k}\exp\{a(h_t, z_t)\}}</math><br />
where a is a feedforward neural network with a single hidden layer. <br />
Then the probability of the next target word is <br />
<br />
<math>p(y_t\ y_{<t}, x)=\frac{1}{Z} \exp\{W_t^T\phi(y_{t-1}, z_t, c_t)+b_t\}</math>. In that <math>\phi</math> is an affine transformation followed by a nonlinear activation, <math>w_t</math> and <math>b_t</math> are the target word vector and the target word bias, respectively. Z is the normalization constant computed by<br />
<br />
<br />
<math> Z=\sum_{k:y_k\in V}\exp\left(W_t^T\phi(y_{t-1}, z_t, c_t)+b_t\right)</math> where V is set of all the target words. <br />
<br />
<br />
The dot product between the feature <math>\phi(y_{t-1}, z_t, c_t)</math> and <math>w_t</math> is required to be done for all words in target vocabulary and is computationally complex and time consuming. Furthermore, the memory requirements grow linearly with respect to the number of target word. This has been a major hurdle for neural machine translations. Recent approaches use a shortlist of 30,000 to 80,000 most frequent words. This makes training more feasible but also has problems of its own. For example, the model degrades heavily if the translation of the source sentence requires many words that are not included in the shortlist. The approach of this paper uses only a subset of sampled target words as an align vector to maximize Eq (6), instead of all the likely target words. The most naïve way to select a subset of target words is selection of K most frequent words. However, This skipping of words from training processes is in contrast with using a large vocabulary, because practically we removed a bunch of words from target dictionary. Jean et al., proposed using an existing word alignment model to align the source and target words in the training corpus and build a dictionary. With the dictionary, for each source sentence, we construct a target word set consisting of the K-most frequent words (according to the estimated unigram probability) and, using the dictionary, at most <math>k\prime</math> likely target words for each source word. K and <math>k\prime</math> may be chosen either to meet the computational requirement or to maximize the translation performance on the development set. <br />
In order to avoid the growing complexity of computing the normalization constant, the authors proposed to use only a small subset <math>v\prime</math> of the target vocabulary at each update<ref><br />
Bengio and Sen´ et al, [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4443871.pdf Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model ],IEEEXplor, 2008<br />
</ref>. <br />
Let us consider the gradient of the log probability of the output in conditional probability of <math>y_t</math>. The gradient is composed of a positive and negative part:<br />
<br />
<br />
<math>\bigtriangledown=\log p(y_t|Y_{<t}, x_t)=\bigtriangledown \mathbf\varepsilon(y_t)-\sum_{k:y_k\in V} p(y_k|y_{<t}, x) \bigtriangledown \mathbf\varepsilon(y_t) </math><br />
where the energy <math>\mathbf\varepsilon</math> is defined as <math>\mathbf\varepsilon(y_i)=W_j^T\phi(y_{j-1}, Z_j, C_j)+b_j</math>. The second term of gradiant is in essence the expected gradiant of the energy as <math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)</math>. <br />
The idea of the proposed approach is to approximate this expectation of the gradient by importance sampling with a small number of samples. Given a predefined proposal distribution Q and a set <math>v\prime</math> of samples from Q, we approximate the expectation with <br />
<br />
<math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)\approx \sum_{k:y_k\in V\prime} \frac{w_k}{\sum_{k\prime:y_k\prime\in V\prime}w_k\prime}\epsilon(y_k)</math> where <math>\,w_k=exp{\epsilon(y_k)-log Q(y_k)}</math><br />
<br />
In practice, the training corpus is partitioned and a subset <math>v\prime</math> of the target vocabulary is defined for each partition prior to training. Before training begins, each target sentence in the training corpus is sequentially examined and accumulate unique target words until the number of unique target<br />
words reaches the predefined threshold τ . The accumulated vocabulary will be used for this partition of the corpus during training. This processes is repeated until the end of the training set is reached. <br />
<br />
In this approach the alignments between the target words and source locations via the alignment model is obtained. This is useful when the model generated an Un token. Once a translation is generated given a source sentence, each Un may be replaced using a translation-specific technique based on the aligned source word. The authors in the experiment, replaced each ''Un'' token with the aligned source word or its most likely translation determined by another word alignment model.<br />
The proposed approach was evaluated in English->French and English-German translation. The neural machine translation model was trained by the bilingual, parallel corpora made available as part of WMT’14. The data sets were used for English to French were European v7, Common Crawl, UN, News Commentary, Gigaword. The data sets for English-German were Europarl v7, Common Crawl, News Commentary. <br />
<br />
The models were evaluated on the WMT’14 test set (news-test 2014)3 , while the concatenation of news-test-2012 and news-test-2013 is used for model selection (development set). Table 1 presents data coverage w.r.t. the vocabulary size, on the target side.<br />
<br />
==Setting==<br />
<br />
As a baseline for English→French translation, the authors used the RNNsearch model proposed by (Bahdanau et al., 2014), with 30,000 source and target words and another RNNsearch was trained for English→German translation with 50,000 source and target words. Using the proposed approach another set of RNNsearch models with much larger vocabularies of 500,000 source and target words was trained for each language pair. Different shortlist sizes used during training: 15,000 and 30,000 for English→French, and 15,000 and 50,000 for English→German. The best performance on the development set were evaluated and reported every twelve hours. For both language pairs, new models were trained with shortlist size of 15, 000 and 50, 000 by reshuffling the data set at the beginning of each epoch. While this causes a non-negligible amount of overhead, such a change allows words to be contrasted with different sets of words for each epoch. The beam search was used to generate a translation given a source. The authors keep a set of 12 hypotheses and normalize probabilities by the length of the candidate sentences which was chosen to maximize the performance on the development set, for K ∈ {15k, 30k, 50k} and K0 ∈ {10, 20}. They used a bilingual dictionary to accelerate decoding and to replace unknown words in translations.<br />
<br />
==Results==<br />
<br />
The results for English-> French translation obtained by the trained models with very large target vocabularies compared with results of previous models reported in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Google<br />
! Phrase-based SMT (cHO et al)<br />
! Phrase-based SMT (Durrani et al)<br />
|-<br />
| BASIC NMT<br />
| 29.97 (26.58)<br />
| 32.68 (28.76)<br />
| 30.6<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 33.08 (29.08)<br />
| 33.36 (29.32)<br />
34.11 (29.98)<br />
| -<br />
33.1<br />
| 33.3<br />
| 37.03<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 34.6 (30.53)<br />
| -<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Ensemble<br />
| -<br />
| 37.19 (31.98)<br />
| 37.5 <br />
| 33.3<br />
| 3703<br />
|-<br />
|}<br />
<br />
<br />
And the results for English->German translation in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Phrase-based SMT <br />
|-<br />
| BASIC NMT<br />
| 16.46 (17.13)<br />
| 16.95 (17.85)<br />
| 20.67<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 18.97 (19.16)<br />
| 17.46 (18.00)<br />
18.89 (19.03)<br />
| 20.67<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 19.4<br />
| 20.67<br />
|-<br />
| + Ensemble<br />
| -<br />
| 21.59<br />
| 20.67 <br />
|-<br />
|}<br />
<br />
It is clear that the RNNsearch-LV outperforms the baseline RNNsearch. In the case of the English→French task, RNNsearch-LV approached the performance level of the previous best single neural machine translation (NMT) model, even without any translation specific techniques. With these, however, the RNNsearch-LV outperformed it. The performance of the RNNsearch-LV is also better than that of a standard phrase-based translation system. <br />
For English→German, the RNNsearch-LV outperformed the baseline before unknown word replacement, but after doing so, the two systems performed similarly. A higher large vocabulary single-model performance is achieved by reshuffling the data set. In this case, we were able to surpass the previously reported best translation result on this task by building an ensemble of 8 models. With τ = 15, 000, the RNNsearch-LV performance worsened a little, with best BLEU scores, without reshuffling, of 33.76 and 18.59 respectively for English→French and English→German.<br />
<br />
The timing information of decoding for different models were presented in Table below. While decoding from RNNsearch-LV with the full target vocabulary is slowest, the speed substantially improves if a candidate list for decoding each translation is used. <br />
{| class="wikitable"<br />
|-<br />
! Method <br />
! CPU i7-4820k<br />
! GPU GTX TITAN black<br />
|-<br />
| RNNsearch<br />
| 0.09 s<br />
| 0.02 s<br />
|-<br />
| RNNsearch-LV <br />
| 0.80 s<br />
| 0.25 s<br />
|-<br />
| RNNsearch-LV<br />
+Candidate list<br />
| 0.12 s<br />
| 0.0.05 s<br />
|}<br />
<br />
The influence of the target vocabulary when translating the test sentences by using the union of a fixed set of 30, 000 common words and (at most) K0 likely candidates for each source word was evaluated for English→French with size of 30, 000. The performance of the system is comparable to the baseline when Uns not replaced, but there is not as much improvement when doing so.<br />
The authors found that K is inversely correlated with t. <br />
<br />
<br />
==Conclusion==<br />
<br />
Using the importance sampling an approach was proposed to be used in machine translation with a large target vocabulary without any substantial increase in computational complexity. The BLUE values for the proposed model showed translation performance comparable to the state-of-the-art translation systems on both the English→French task and English→German task.<br />
On English→French and English→German translation tasks, the neural machine translation models trained using the proposed method performed as well as, or better than, those using only limited sets of target words, even when replacing unknown words.<br />
<br />
<br />
== Bibliography ==<br />
<references /></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=joint_training_of_a_convolutional_network_and_a_graphical_model_for_human_pose_estimation&diff=27266joint training of a convolutional network and a graphical model for human pose estimation2015-12-13T01:16:50Z<p>Rtwang: /* Convolutional Network Part-Detector */</p>
<hr />
<div>== Introduction ==<br />
<br />
Human body pose estimation, or specifically the localization of human joints in monocular RGB images, remains a very challenging task in computer vision. Recent approaches to this problem fall into two broad categories: traditional deformable part models and deep-learning based discriminative models. Traditional models rely on the aggregation of hand-crafted low-level features and then use a standard classifier or a higher level generative model to detect the pose, which require the features to be sensitive enough and invariant to deformations. Deep learning approaches learn an empirical set of low and high-level features which are more tolerant to variations. However, it’s difficult to incorporate prior knowledge about the structure of the human body.<br />
<br />
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network Part-Detector and a part-based Spatial-Model. This combination and joint training significantly outperforms existing state-of-the-art models on the task of human body pose recognition.<br />
<br />
== Model ==<br />
=== Convolutional Network Part-Detector ===<br />
<br />
They combine an efficient ConvNet architecture with multi-resolution and overlapping receptive fields, which is shown in the figure below.<br />
<br />
[[File:architecture1.PNG | center]]<br />
<br />
Traditionally, in image processing tasks such as these, a Laplacian Pyramid<ref><br />
[https://en.wikipedia.org/wiki/Pyramid_(image_processing)#Gaussian_pyramid "Pyramid (image processing)"]<br />
</ref> of three resolution banks is used to provide each bank with non-overlapping spectral content. Then the Local Contrast Normalization (LCN<ref><br />
Collobert R, Kavukcuoglu K, Farabet C.[http://infoscience.epfl.ch/record/192376/files/Collobert_NIPSWORKSHOP_2011.pdf Torch7: A matlab-like environment for machine learning] BigLearn, NIPS Workshop. 2011 (EPFL-CONF-192376).<br />
</ref>) is applied to those input images. However, in this model, only a full image stage and a half-resolution stage was used, allowing for a simpler architecture and faster training.<br />
<br />
Although, a sliding window architecture is usually used for this type of task, it has the down side of creating redundant convolutions. Instead, in this network, for each resolution bank, ConvNet architecture with overlapping receptive fields is used to get a heat-map as output, which produces a per-pixel likelihood for key joint locations on the human skeleton.<br />
<br />
The following figure shows a Efficient Sliding Window Model with Overlapping Receptive Fields,<br />
<br />
[[File:Qq1.png | center]]<br />
<br />
The convolution results (feature maps) of the low resolution bank are upscaled and interleaved with those of high resolution bank. Then, these dense feature maps are processed through convolution stages at each pixel, which is equivalent to fully-connected network model but more efficient.<br />
<br />
Supervised training of the network is performed using batched Stochastic Gradient Descent (SGD) with Nesterov Momentum. <br />
<br />
Nesterov momentum can be written as<ref><br />
Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance<br />
of initialization and momentum in deep learning. In Proceedings of the 30th International<br />
Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28<br />
of JMLR Proceedings, pages 1139–1147. JMLR.org, 2013.<br />
</ref>:<br />
<br />
[[File:Nmomentum.PNG]]<br />
<br />
Rather than adding each set of gradients from the stochastic batch process separately, a velocity vector is instead accumulated at some rate <math>\,\mu</math> so that if the gradient descent process continuously travel in the same general direction, then this velocity vector would increase over each successive descent and travel faster towards that direction than conventional gradient descent. This should increase the convergence rate and decrease number of epochs needed to converge to some local minima. Nesterov momentum does make one modification and that is to correct the direction of the velocity vector with <math>\,\epsilon\triangledown f(\theta_t+\mu v_t)</math> not at the current position, but at the future predicted position. The difference can be seen in the figure<ref><br />
Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance<br />
of initialization and momentum in deep learning. In Proceedings of the 30th International<br />
Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28<br />
of JMLR Proceedings, pages 1139–1147. JMLR.org, 2013.<br />
</ref>:<br />
<br />
[[File:Moment.PNG]]<br />
<br />
This correction lets the descent direction be more sensitive to changes in directions and increases stability. This can be seen as looking at the future gradient to evaluate the suitability of the current gradient direction. This is evident in the figure where the first changes direction based purely at the current position and the second corrects the direction based on the next gradient.<br />
<br />
They use a Mean Squared Error (MSE) criterion to minimize the distance between the predicted output and a target heat-map. At training time they also perform random perturbations of the input images (randomly flipping and scaling the images) to increase generalization performance.<br />
<br />
=== Higher-Level Spatial-Model ===<br />
<br />
They use a higher-level Spatial-Model to get rid of false positive outliers and anatomically incorrect poses predicted by the Part-Detector, constraining joint inter-connectivity and enforcing global pose consistency.<br />
<br />
They formulate the Spatial-Model as an MRF-like model over the distribution of spatial locations for each body part. After the unary potentials for each body part location are provided by the Part-Detector, the pair-wise potentials in the graph are computed using convolutional priors, which model the conditional distribution of the location of one body part to another. For instance, the final marginal likelihood for a body part A can be calculated as:<br />
<br />
<math>\bar{p}_{A}=\frac{1}{Z}\prod_{v\in V}^{ }\left ( p_{A|v}*p_{v}+b_{v\rightarrow A} \right )</math><br />
<br />
Where <math>v</math> is the joint location, <math>p_{A|v}</math> is the conditional prior which is the likelihood of the body part A occurring in pixel location (i, j) when joint <math>v</math> is located at the center pixel, <math>b_{v\rightarrow A}</math> is a bias term used to describe the background probability for the message from joint <math>v</math> to A, and Z is the partition function. The learned pair-wise distributions are purely uniform when any pairwise edge should be removed from the graph structure. The above equation is analogous to a single round of sum-product belief propagation. Convergence to a global optimum is not guaranteed given that this spatial model is not tree structured. However, the inferred solution is sufficiently accurate for all poses in datasets used in this research.<br />
<br />
For their practical implementation, they treat the distributions above as energies to avoid the evaluation of Z in the equation before. Their final model is<br />
<br />
<math>\bar{e}_{A}=\mathrm{exp}\left ( \sum_{v\in V}^{ }\left [ \mathrm{log}\left ( \mathrm{SoftPlus}\left ( e_{A|v} \right )*\mathrm{ReLU}\left ( e_{v} \right )+\mathrm{SoftPlus}\left ( b_{v\rightarrow A} \right ) \right ) \right ] \right )</math><br />
<br />
<math>\mathrm{where:SoftPlus}\left ( x \right )=\frac{1}{\beta }\mathrm{log}\left ( 1+\mathrm{exp}\left ( \beta x \right ) \right ), 0.5\leq \beta \leq 2</math><br />
<br />
<br />
<math>\mathrm{ReLU}\left ( x \right )=\mathrm{max}\left ( x,\epsilon \right ), 0< \epsilon \leq 0.01</math><br />
<br />
This model replaces the outer multiplication of final marginal likelihood with a log space addition to improve numerical stability and to prevent coupling of the convolution output gradients (the addition in log space means that the partial derivative of the loss function with respect to the convolution output is not dependent on the output of any other stages).<br />With this modified formulation, the equation can be trained by using back-propagation and SGD. The network-based implementation of the equation is shown below.<br />
<br />
[[File:architecture2.PNG | center]]<br />
<br />
The convolution sizes are adjusted so that the largest joint displacement is covered within the convolution<br />
window. For the 90x60 pixel heat-map output, this results in large 128x128 convolution<br />
kernels to account for a joint displacement radius of 64 pixels (padding is added on the<br />
heat-map input to prevent pixel loss).<br />
The convolution kernels they use in this step is quite large, thus they apply FFT convolutions based on the GPU, which is introduced by Mathieu et al.<ref><br />
Mathieu M, Henaff M, LeCun Y.[http://arxiv.org/pdf/1312.5851.pdf Fast training of convolutional networks through ffts] arXiv preprint arXiv:1312.5851, 2013.<br />
</ref>.The convolution weights are initialized using the empirical histogram of joint displacements created from the training examples. Moreover, during training they randomly flip and scale the heat-map inputs to improve generalization performance.<br />
<br />
=== Unified Model ===<br />
<br />
They first train the Part-Detector separately and store the heat-map outputs, then use these heat-maps to train a Spatial-Model. Finally, they combine the trained Part-Detector and Spatial-Models and back-propagate through the entire network, which further improves performance.<br />
Because the SpatialModel is able to effectively reduce the output dimension of possible heat-map activations, the PartDetector can use available learning capacity to better localize the precise target activation.<br />
<br />
== Results ==<br />
<br />
They evaluated their architecture on the FLIC and extended-LSP datasets. The FLIC dataset is comprised of 5003 images from Hollywood movies with actors in predominantly front-facing standing up poses, while the extended-LSP dataset contains a wider variety of poses of athletes playing sport. They also proposed a new dataset called FLIC-plus<ref><br />
[http://cims.nyu.edu/~tompson/flic_plus.htm "FLIC-plus Dataset"]<br />
</ref> which is fairer than FLIC-full dataset.<br />
<br />
Their model’s performance on the FLIC test-set for the elbow and wrist joints is shown below. It’s trained by using both the FLIC and FLIC-plus training sets.<br />
<br />
[[File:result1.PNG | center]]<br />
<br />
Performance on the LSP dataset is shown here.<br />
<br />
[[File:result2.PNG | center]]<br />
<br />
Since the LSP dataset cover a larger range of the possible poses, their Spatial-Model is less effective. The accuracy for this dataset is lower than FLIC. They believe that increasing the size of the training set will improve performance for these difficult cases.<br />
<br />
The following figure shows the predicted joint locations for a variety of inputs in the FLIC and LSP test-sets. The<br />
network produces convincing results on the FLIC dataset (with low joint position error), however,<br />
because the simple Spatial-Model is less effective for a number of the highly articulated poses in<br />
the LSP dataset, the detector results in incorrect joint predictions for some images. Increasing the size of the training set will improve performance for these difficult cases.<br />
<br />
[[File:M2.png | center]]<br />
<br />
== Conclusion ==<br />
<br />
This paper shows that n that the unification of a novel ConvNet Part-Detector and an MRF inspired SpatialModel<br />
into a single learning framework significantly outperforms existing architectures on the task<br />
of human body pose recognition. Training and inference of the architecture uses commodity level<br />
hardware and runs at close to real-time frame rates, making this technique tractable for a wide variety<br />
of application areas.<br />
<br />
== Bibliography ==<br />
<references /></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Moment.PNG&diff=27265File:Moment.PNG2015-12-13T01:10:06Z<p>Rtwang: </p>
<hr />
<div></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Nmomentum.PNG&diff=27264File:Nmomentum.PNG2015-12-13T00:58:33Z<p>Rtwang: </p>
<hr />
<div></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=very_Deep_Convoloutional_Networks_for_Large-Scale_Image_Recognition&diff=27248very Deep Convoloutional Networks for Large-Scale Image Recognition2015-12-13T00:21:46Z<p>Rtwang: /* References */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper<ref><br />
Simonyan, Karen, and Andrew Zisserman. [http://arxiv.org/pdf/1409.1556.pdf "Very deep convolutional networks for large-scale image recognition."] arXiv preprint arXiv:1409.1556 (2014).</ref> the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting is investigated. It was demonstrated that the representation depth is beneficial for the<br />
classification accuracy and the main contribution is a thorough evaluation of networks of increasing depth using a certain architecture with very small (3×3) convolution filters. Basically, they fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers. As a result, they come up with significantly more accurate ConvNet architectures.<br />
<br />
= Conv.Net Configurations =<br />
<br />
Architecture:<br />
<br />
During training, the image is passed through a stack of convolutional (conv.) layers with filters with a very small receptive field: 3 × 3. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. Layers. Max-pooling is performed over a 2 × 2 pixel window, with stride 2. A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers. The final layer is the soft-max layer and all hidden layers are equipped with the rectification non-linearity.<br />
<br />
They don't implement Local Response Normalization (LRN) as they found such normalization does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.<br />
<br />
Configuration:<br />
<br />
The ConvNet configurations, evaluated in this paper, are outlined in the following table:<br />
<br />
<br />
[[File:4.PNG | center]]<br />
<br />
<br />
All configurations follow the aforementioned architecture and differ only in the depth from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers) (the added layers are shown in bold). Besides, the width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.<br />
<br />
As stated in the table, multiple convolutional layers with small filters are used without any maxpooling layer between them. It is easy to show that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5, but using two/three stack of conv. layers have 2 main advantages:<br />
1) Two/three non-linear rectification layers are incorporated instead of a single one, which makes the decision function more discriminative.<br />
2) the number of parameters is decreased.<br />
<br />
In the meantime, Since the 1×1 convolution is essentially a linear projection onto the space of the same dimensionality, the incorporation of 1 × 1 conv. layers (configuration C) is a way to increase the nonlinearity of the decision function without affecting the receptive fields of the conv. layers because of the rectification function.<br />
<br />
= Classification Framework =<br />
<br />
In this section, the details of classification ConvNet training and evaluation is described.<br />
<br />
===Training===<br />
<br />
Training is carried out by optimizing the multinomial logistic regression objective using mini-batch gradient descent with momentum. Initial weights for some layers were obtained from configuration “A” which is shallow enough to be trained with random initialization. The intermediate layers in deep models were initialized randomly.<br />
In spite of the larger number of parameters and the greater depth of the introduced nets, these nets required relatively fewer epochs to converge due to the following reasons:<br />
(a) implicit regularization imposed by greater depth and smaller conv. filter sizes.<br />
(b) using pre-initialization of certain layers.<br />
<br />
With respect to (b) above, the shallowest configuration (A in the previous table) was trained using random initialization. For all the other configurations, the first four convolutional layers and the last 3 fully connected layers were initialized with the corresponding parameters from A, to avoid getting stuck during training due to a bad initialization. All other layers were randomly initialized by sampling from a normal distribution with 0 mean.<br />
<br />
During training, the input to the ConvNets is a fixed-size 224 × 224 RGB image. To obtain this fixed-size image, rescaling has been done while training (one crop per image per SGD iteration). In order to rescale the input image, a training scale, from which the ConvNet input is cropped, should be determined.<br />
Two approaches for setting the training scale S (Let S be the smallest side of an isotropically-rescaled training image) is considered:<br />
1) single-scale training, that requires a fixed S. <br />
2) multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range [Smin, Smax] .<br />
<br />
===Implementation===<br />
<br />
To improve overall training speed of each model, the researchers introduced parallelization to the mini batch gradient descent process. Since the model is very deep, training on a single GPU would take months to finish. To speed up the process, the researchers trained separate batches of images on each GPU in parallel to calculate the gradients. For example, with 4 GPUs, the model would take 4 batches of images, calculate their separate gradients and then finally take an average of four sets of gradients as training. (Krizhevsky et al., 2012) introduced more complicated ways to parallelize training convolutional neural networks but the researchers found that this simple configuration speed up training process by a factor of 3.75 with 4 GPUs and with a possible maximum of 4, the simple configuration worked well enough. <br />
Finally, it took 2–3 weeks to train a single net by using four NVIDIA Titan Black GPUs.<br />
<br />
===Testing===<br />
<br />
At test time, in order to classify the input image:<br />
First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q. <br />
Then, the network is applied densely over the rescaled test image in a way that the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers).<br />
Then The resulting fully-convolutional net is then applied to the whole (uncropped) image.<br />
<br />
= Classification Experiments =<br />
In this section, the image classification results on the ILSVRC-2012 dataset are described:<br />
<br />
== Single-Scale Evaluation ==<br />
<br />
In the first part of the experiment, the test image size was set as Q = S for fixed S, and Q = 0.5(Smin + Smax) for jittered. One important result of this evaluation was that that the classification error decreases with the increased ConvNet depth.<br />
Moreover, The worse performance of the configuration with 1x1 filter (C ) in comparison with the one with 3x3 filter (D) indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C).<br />
Finally, scale jittering at training time leads to significantly better results than training on images with fixed smallest side. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.<br />
<br />
[[File:ConvNet1.PNG | center]]<br />
<br />
== Multi-Scale Evaluation ==<br />
<br />
In addition to single scale evaluation stated in the previous section, in this paper, the effect of scale jittering at test time is assessed by running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors. The results indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale).<br />
<br />
Their best single-network performance on the validation set is 24.8%/7.5% top-1/top-5 error. On the test set, the configuration E achieves 7.3% top-5 error.<br />
<br />
[[File:ConvNet2.PNG | center]]<br />
<br />
== Comparison With The State Of The Art ==<br />
<br />
Their very deep ConvNets significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions.<br />
<br />
[[File:ConvNet3.PNG | center]]<br />
<br />
= Appendix A: Localization =<br />
<br />
In addition to classification, the introduced architectures have been used for localization purposes. To perform object localisation, a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores is used. Apart from the last bounding box prediction layer, the ConvNet architecture D which was found to be the best-performing in the classification task is implemented and training of localisation ConvNets is similar to that of the classification ConvNets. The main difference is that the logistic regression objective is replaced with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth.<br />
Two testing protocols are considered:<br />
The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class. (The bounding box is obtained by applying the network only to the central crop of the image.)<br />
The second, fully-fledged, testing procedure is based on the dense application of the localization ConvNet to the whole image, similarly to the classification task.<br />
<br />
the localization experiments indicate that performance advancement brought by the introduced very deep ConvNets produces considerably better results with a simpler localization method, but a more powerful representation.<br />
<br />
= Conclusion =<br />
<br />
Very deep ConvNets are introduced in this paper. The results show that the configuration has good performance on classification and localization and significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions. Details and more results on these competitions can be found here.<ref><br />
Russakovsky, Olga, et al. [http://arxiv.org/pdf/1409.0575v3.pdf "Imagenet large scale visual recognition challenge."] International Journal of Computer Vision (2014): 1-42.<br />
</ref> They also showed that their configuration is applicable to some other datasets.<br />
<br />
= Resources =<br />
<br />
The Oxford Visual Geometry Group (VGG) has released code for their 16-layer and 19-layer models. The code is available on their [http://www.robots.ox.ac.uk/~vgg/research/very_deep/ website] in the format used by the [http://caffe.berkeleyvision.org/ Caffe] toolbox and includes the weights of the pretrained networks.<br />
<br />
=References=<br />
<references /><br />
<br />
Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.</div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=very_Deep_Convoloutional_Networks_for_Large-Scale_Image_Recognition&diff=27247very Deep Convoloutional Networks for Large-Scale Image Recognition2015-12-13T00:21:01Z<p>Rtwang: /* Classification Framework */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper<ref><br />
Simonyan, Karen, and Andrew Zisserman. [http://arxiv.org/pdf/1409.1556.pdf "Very deep convolutional networks for large-scale image recognition."] arXiv preprint arXiv:1409.1556 (2014).</ref> the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting is investigated. It was demonstrated that the representation depth is beneficial for the<br />
classification accuracy and the main contribution is a thorough evaluation of networks of increasing depth using a certain architecture with very small (3×3) convolution filters. Basically, they fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers. As a result, they come up with significantly more accurate ConvNet architectures.<br />
<br />
= Conv.Net Configurations =<br />
<br />
Architecture:<br />
<br />
During training, the image is passed through a stack of convolutional (conv.) layers with filters with a very small receptive field: 3 × 3. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. Layers. Max-pooling is performed over a 2 × 2 pixel window, with stride 2. A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers. The final layer is the soft-max layer and all hidden layers are equipped with the rectification non-linearity.<br />
<br />
They don't implement Local Response Normalization (LRN) as they found such normalization does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.<br />
<br />
Configuration:<br />
<br />
The ConvNet configurations, evaluated in this paper, are outlined in the following table:<br />
<br />
<br />
[[File:4.PNG | center]]<br />
<br />
<br />
All configurations follow the aforementioned architecture and differ only in the depth from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers) (the added layers are shown in bold). Besides, the width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.<br />
<br />
As stated in the table, multiple convolutional layers with small filters are used without any maxpooling layer between them. It is easy to show that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5, but using two/three stack of conv. layers have 2 main advantages:<br />
1) Two/three non-linear rectification layers are incorporated instead of a single one, which makes the decision function more discriminative.<br />
2) the number of parameters is decreased.<br />
<br />
In the meantime, Since the 1×1 convolution is essentially a linear projection onto the space of the same dimensionality, the incorporation of 1 × 1 conv. layers (configuration C) is a way to increase the nonlinearity of the decision function without affecting the receptive fields of the conv. layers because of the rectification function.<br />
<br />
= Classification Framework =<br />
<br />
In this section, the details of classification ConvNet training and evaluation is described.<br />
<br />
===Training===<br />
<br />
Training is carried out by optimizing the multinomial logistic regression objective using mini-batch gradient descent with momentum. Initial weights for some layers were obtained from configuration “A” which is shallow enough to be trained with random initialization. The intermediate layers in deep models were initialized randomly.<br />
In spite of the larger number of parameters and the greater depth of the introduced nets, these nets required relatively fewer epochs to converge due to the following reasons:<br />
(a) implicit regularization imposed by greater depth and smaller conv. filter sizes.<br />
(b) using pre-initialization of certain layers.<br />
<br />
With respect to (b) above, the shallowest configuration (A in the previous table) was trained using random initialization. For all the other configurations, the first four convolutional layers and the last 3 fully connected layers were initialized with the corresponding parameters from A, to avoid getting stuck during training due to a bad initialization. All other layers were randomly initialized by sampling from a normal distribution with 0 mean.<br />
<br />
During training, the input to the ConvNets is a fixed-size 224 × 224 RGB image. To obtain this fixed-size image, rescaling has been done while training (one crop per image per SGD iteration). In order to rescale the input image, a training scale, from which the ConvNet input is cropped, should be determined.<br />
Two approaches for setting the training scale S (Let S be the smallest side of an isotropically-rescaled training image) is considered:<br />
1) single-scale training, that requires a fixed S. <br />
2) multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range [Smin, Smax] .<br />
<br />
===Implementation===<br />
<br />
To improve overall training speed of each model, the researchers introduced parallelization to the mini batch gradient descent process. Since the model is very deep, training on a single GPU would take months to finish. To speed up the process, the researchers trained separate batches of images on each GPU in parallel to calculate the gradients. For example, with 4 GPUs, the model would take 4 batches of images, calculate their separate gradients and then finally take an average of four sets of gradients as training. (Krizhevsky et al., 2012) introduced more complicated ways to parallelize training convolutional neural networks but the researchers found that this simple configuration speed up training process by a factor of 3.75 with 4 GPUs and with a possible maximum of 4, the simple configuration worked well enough. <br />
Finally, it took 2–3 weeks to train a single net by using four NVIDIA Titan Black GPUs.<br />
<br />
===Testing===<br />
<br />
At test time, in order to classify the input image:<br />
First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q. <br />
Then, the network is applied densely over the rescaled test image in a way that the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers).<br />
Then The resulting fully-convolutional net is then applied to the whole (uncropped) image.<br />
<br />
= Classification Experiments =<br />
In this section, the image classification results on the ILSVRC-2012 dataset are described:<br />
<br />
== Single-Scale Evaluation ==<br />
<br />
In the first part of the experiment, the test image size was set as Q = S for fixed S, and Q = 0.5(Smin + Smax) for jittered. One important result of this evaluation was that that the classification error decreases with the increased ConvNet depth.<br />
Moreover, The worse performance of the configuration with 1x1 filter (C ) in comparison with the one with 3x3 filter (D) indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C).<br />
Finally, scale jittering at training time leads to significantly better results than training on images with fixed smallest side. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.<br />
<br />
[[File:ConvNet1.PNG | center]]<br />
<br />
== Multi-Scale Evaluation ==<br />
<br />
In addition to single scale evaluation stated in the previous section, in this paper, the effect of scale jittering at test time is assessed by running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors. The results indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale).<br />
<br />
Their best single-network performance on the validation set is 24.8%/7.5% top-1/top-5 error. On the test set, the configuration E achieves 7.3% top-5 error.<br />
<br />
[[File:ConvNet2.PNG | center]]<br />
<br />
== Comparison With The State Of The Art ==<br />
<br />
Their very deep ConvNets significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions.<br />
<br />
[[File:ConvNet3.PNG | center]]<br />
<br />
= Appendix A: Localization =<br />
<br />
In addition to classification, the introduced architectures have been used for localization purposes. To perform object localisation, a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores is used. Apart from the last bounding box prediction layer, the ConvNet architecture D which was found to be the best-performing in the classification task is implemented and training of localisation ConvNets is similar to that of the classification ConvNets. The main difference is that the logistic regression objective is replaced with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth.<br />
Two testing protocols are considered:<br />
The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class. (The bounding box is obtained by applying the network only to the central crop of the image.)<br />
The second, fully-fledged, testing procedure is based on the dense application of the localization ConvNet to the whole image, similarly to the classification task.<br />
<br />
the localization experiments indicate that performance advancement brought by the introduced very deep ConvNets produces considerably better results with a simpler localization method, but a more powerful representation.<br />
<br />
= Conclusion =<br />
<br />
Very deep ConvNets are introduced in this paper. The results show that the configuration has good performance on classification and localization and significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions. Details and more results on these competitions can be found here.<ref><br />
Russakovsky, Olga, et al. [http://arxiv.org/pdf/1409.0575v3.pdf "Imagenet large scale visual recognition challenge."] International Journal of Computer Vision (2014): 1-42.<br />
</ref> They also showed that their configuration is applicable to some other datasets.<br />
<br />
= Resources =<br />
<br />
The Oxford Visual Geometry Group (VGG) has released code for their 16-layer and 19-layer models. The code is available on their [http://www.robots.ox.ac.uk/~vgg/research/very_deep/ website] in the format used by the [http://caffe.berkeleyvision.org/ Caffe] toolbox and includes the weights of the pretrained networks.<br />
<br />
=References=<br />
<references /></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=27179learning Phrase Representations2015-12-12T17:52:53Z<p>Rtwang: /* Alternative Models */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
Because each hidden unit has separate gates, it is possible for each hidden to unit to learn to capture dependencies over different lengths of time (determined by the frequency at which its reset and updates gates are active).<br />
<br />
=Scoring Phrase Pairs with RNN Encoder–Decoder =<br />
<br />
In a commonly used statistical machine translation system (SMT), the goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes<br />
<br />
<math>p(\mathbf{f} | \mathbf{e})\propto p(\mathbf{e} | \mathbf{f})p(\mathbf{f})</math><br />
<br />
where the first term at the right hand side is called translation model and the latter language model (see, e.g., (Koehn, 2005)). In practice, however, most SMT systems model <math>\log p(\mathbf{f} | \mathbf{e})</math> as a loglinear model with additional features and corresponding weights:<br />
<br />
<math>\log p(\mathbf{f} | \mathbf{e})=\sum_{n=1}^Nw_nf_n(\mathbf{f},\mathbf{e})+\log Z(\mathbf{e})</math><br />
<br />
where <math>f_n</math> and <math>w_n</math> are the n-th feature and weight, respectively. <math>Z(\mathbf{e})</math> is a normalization constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a development set.<br />
<br />
Cho et al. propose to train the RNN Encoder–Decoder on a table of phrase pairs and use its scores as additional features in the loglinear model showed above when tuning the SMT decoder. For training the RNN-Encoder-Decoder, phrase frequency is ignored for several reasons: to reduce computation time, to ensure the model does not simply rank phrases by frequency, and because frequency information is already encoded in the features for the SMT (so it's better to not use the capacity of the RNN-Encoder-Decoder redundantly).<br />
<br />
=Alternative Models=<br />
The researchers noted a number of other potential translation models and their usability.<br />
<br />
The first model is by Schwenk and it is an application of a variant of the continuous space language model to the task of machine translation. The model is essentially a feedforward neural network with a common projection for input words encoded as bag of words vectors. Schwenk fixed the input and output sentence length and for a fixed length, the neural network attempts to estimate the probability of the output sequence of words and score potential translations. However, a major disadvantage is that the input and output length are fixed and cannot handle variable length inputs or outputs.<br />
<br />
The model figure<ref><br />
[Schwenk2012] Holger Schwenk. 2012. Continuous<br />
space translation models for phrase-based statistical<br />
machine translation. In Martin Kay and Christian<br />
Boitet, editors, Proceedings of the 24th International<br />
Conference on Computational Linguistics<br />
(COLIN), pages 1071–1080.<br />
</ref>:<br />
<br />
[[File:CONTINUOUS.PNG]]<br />
<br />
Another model, similar to Schwenk's, is by Devlin and a feedforward neural network is also used. Rather than estimating the probability of the entire output sequence of words in Schwenk's model, Devlin only estimates the probability of the next word and uses both a portion of the input sentence and a portion of the output sentence. It reported impressive improvements but similar to Schwenk, it fixes the length of input prior to training.<br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. In building the model Cho et al. used baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref><br />
<br />
They tried the following combinations:<br />
1. Baseline configuration<br />
2. Baseline + RNN<br />
3. Baseline + CSLM + RNN<br />
4. Baseline + CSLM + RNN + Word penalty<br />
<br />
Results:<br />
<br />
<center><br />
[[File:encdec3.png |frame | center |Fig 3. BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks. ]]<br />
</center><br />
<br />
The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder–Decoder. This suggests that the contributions of the CSLM and the RNN Encoder– Decoder are not too correlated and that one can expect better results by improving each method independently<br />
<br />
<br />
== Word and Phrase Representations ==<br />
<br />
As the presented model maps sentences into a continuous space vector and prior continuous space language models have been known to learn semantically meaningful embeddings, one could expect this to happen for the presented model, too. This is indeed the case. When projecting to a 2D space (with Barnes-Hut-SNE), semantically similar words are clearly clustered.<br />
<br />
[[File:Fig4.png]]<br />
<br />
Phrases are also clustered capturing both semantic and syntactic structures.<br />
<br />
[[File:Fig5.png]]<br />
<br />
= References=<br />
<references /></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=27178learning Phrase Representations2015-12-12T17:51:31Z<p>Rtwang: /* Alternative Models */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
Because each hidden unit has separate gates, it is possible for each hidden to unit to learn to capture dependencies over different lengths of time (determined by the frequency at which its reset and updates gates are active).<br />
<br />
=Scoring Phrase Pairs with RNN Encoder–Decoder =<br />
<br />
In a commonly used statistical machine translation system (SMT), the goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes<br />
<br />
<math>p(\mathbf{f} | \mathbf{e})\propto p(\mathbf{e} | \mathbf{f})p(\mathbf{f})</math><br />
<br />
where the first term at the right hand side is called translation model and the latter language model (see, e.g., (Koehn, 2005)). In practice, however, most SMT systems model <math>\log p(\mathbf{f} | \mathbf{e})</math> as a loglinear model with additional features and corresponding weights:<br />
<br />
<math>\log p(\mathbf{f} | \mathbf{e})=\sum_{n=1}^Nw_nf_n(\mathbf{f},\mathbf{e})+\log Z(\mathbf{e})</math><br />
<br />
where <math>f_n</math> and <math>w_n</math> are the n-th feature and weight, respectively. <math>Z(\mathbf{e})</math> is a normalization constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a development set.<br />
<br />
Cho et al. propose to train the RNN Encoder–Decoder on a table of phrase pairs and use its scores as additional features in the loglinear model showed above when tuning the SMT decoder. For training the RNN-Encoder-Decoder, phrase frequency is ignored for several reasons: to reduce computation time, to ensure the model does not simply rank phrases by frequency, and because frequency information is already encoded in the features for the SMT (so it's better to not use the capacity of the RNN-Encoder-Decoder redundantly).<br />
<br />
=Alternative Models=<br />
The researchers noted a number of other potential translation models and their usability.<br />
<br />
The first model is by Schwenk and it is an application of a variant of the continuous space language model to the task of machine translation. The model is essentially a feedforward neural network with a common projection for input words encoded as bag of words vectors. Schwenk fixed the input and output sentence length and for a fixed length, the neural network attempts to estimate the probability of the output sequence of words and score potential translations. However, a major disadvantage is that the input and output length are fixed and cannot handle variable length inputs or outputs.<br />
<br />
The model figure<ref><br />
[Schwenk2012] Holger Schwenk. 2012. Continuous<br />
space translation models for phrase-based statistical<br />
machine translation. In Martin Kay and Christian<br />
Boitet, editors, Proceedings of the 24th International<br />
Conference on Computational Linguistics<br />
(COLIN), pages 1071–1080.<br />
</ref>:<br />
<br />
[[File:CONTINUOUS.PNG]]<br />
<br />
Another model, similar to Schwenk's, is by Devlin and a feedforward neural network is also used. Rather than estimating the probability of the entire output sequence of words in Schwenk's model, Devlin only estimates the probability of the next word and uses both a portion of the input sentence and a portion of the output sentence. However, similar to Schwenk, it fixes the length of input prior to training.<br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. In building the model Cho et al. used baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref><br />
<br />
They tried the following combinations:<br />
1. Baseline configuration<br />
2. Baseline + RNN<br />
3. Baseline + CSLM + RNN<br />
4. Baseline + CSLM + RNN + Word penalty<br />
<br />
Results:<br />
<br />
<center><br />
[[File:encdec3.png |frame | center |Fig 3. BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks. ]]<br />
</center><br />
<br />
The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder–Decoder. This suggests that the contributions of the CSLM and the RNN Encoder– Decoder are not too correlated and that one can expect better results by improving each method independently<br />
<br />
<br />
== Word and Phrase Representations ==<br />
<br />
As the presented model maps sentences into a continuous space vector and prior continuous space language models have been known to learn semantically meaningful embeddings, one could expect this to happen for the presented model, too. This is indeed the case. When projecting to a 2D space (with Barnes-Hut-SNE), semantically similar words are clearly clustered.<br />
<br />
[[File:Fig4.png]]<br />
<br />
Phrases are also clustered capturing both semantic and syntactic structures.<br />
<br />
[[File:Fig5.png]]<br />
<br />
= References=<br />
<references /></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=27177learning Phrase Representations2015-12-12T17:40:15Z<p>Rtwang: /* Alternative Models */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
Because each hidden unit has separate gates, it is possible for each hidden to unit to learn to capture dependencies over different lengths of time (determined by the frequency at which its reset and updates gates are active).<br />
<br />
=Scoring Phrase Pairs with RNN Encoder–Decoder =<br />
<br />
In a commonly used statistical machine translation system (SMT), the goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes<br />
<br />
<math>p(\mathbf{f} | \mathbf{e})\propto p(\mathbf{e} | \mathbf{f})p(\mathbf{f})</math><br />
<br />
where the first term at the right hand side is called translation model and the latter language model (see, e.g., (Koehn, 2005)). In practice, however, most SMT systems model <math>\log p(\mathbf{f} | \mathbf{e})</math> as a loglinear model with additional features and corresponding weights:<br />
<br />
<math>\log p(\mathbf{f} | \mathbf{e})=\sum_{n=1}^Nw_nf_n(\mathbf{f},\mathbf{e})+\log Z(\mathbf{e})</math><br />
<br />
where <math>f_n</math> and <math>w_n</math> are the n-th feature and weight, respectively. <math>Z(\mathbf{e})</math> is a normalization constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a development set.<br />
<br />
Cho et al. propose to train the RNN Encoder–Decoder on a table of phrase pairs and use its scores as additional features in the loglinear model showed above when tuning the SMT decoder. For training the RNN-Encoder-Decoder, phrase frequency is ignored for several reasons: to reduce computation time, to ensure the model does not simply rank phrases by frequency, and because frequency information is already encoded in the features for the SMT (so it's better to not use the capacity of the RNN-Encoder-Decoder redundantly).<br />
<br />
=Alternative Models=<br />
The researchers noted a number of other potential translation models and their usability.<br />
<br />
The first model is by Schwenk and it is an application of a variant of the continuous space language model to the task of machine translation. The model is essentially a feedforward neural network with a common projection for input words encoded as bag of words vectors. Schwenk fixed the input and output sentence length and for a fixed length, the neural network attempts to estimate the probability of the output sequence of words and score potential translations. However, a major disadvantage is that the input and output length are fixed and cannot handle variable length inputs or outputs.<br />
<br />
The model figure<ref><br />
[Schwenk2012] Holger Schwenk. 2012. Continuous<br />
space translation models for phrase-based statistical<br />
machine translation. In Martin Kay and Christian<br />
Boitet, editors, Proceedings of the 24th International<br />
Conference on Computational Linguistics<br />
(COLIN), pages 1071–1080.<br />
</ref>:<br />
<br />
[[File:CONTINUOUS.PNG]]<br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. In building the model Cho et al. used baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref><br />
<br />
They tried the following combinations:<br />
1. Baseline configuration<br />
2. Baseline + RNN<br />
3. Baseline + CSLM + RNN<br />
4. Baseline + CSLM + RNN + Word penalty<br />
<br />
Results:<br />
<br />
<center><br />
[[File:encdec3.png |frame | center |Fig 3. BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks. ]]<br />
</center><br />
<br />
The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder–Decoder. This suggests that the contributions of the CSLM and the RNN Encoder– Decoder are not too correlated and that one can expect better results by improving each method independently<br />
<br />
<br />
== Word and Phrase Representations ==<br />
<br />
As the presented model maps sentences into a continuous space vector and prior continuous space language models have been known to learn semantically meaningful embeddings, one could expect this to happen for the presented model, too. This is indeed the case. When projecting to a 2D space (with Barnes-Hut-SNE), semantically similar words are clearly clustered.<br />
<br />
[[File:Fig4.png]]<br />
<br />
Phrases are also clustered capturing both semantic and syntactic structures.<br />
<br />
[[File:Fig5.png]]<br />
<br />
= References=<br />
<references /></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=27176learning Phrase Representations2015-12-12T17:39:57Z<p>Rtwang: /* Alternative Models */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
Because each hidden unit has separate gates, it is possible for each hidden to unit to learn to capture dependencies over different lengths of time (determined by the frequency at which its reset and updates gates are active).<br />
<br />
=Scoring Phrase Pairs with RNN Encoder–Decoder =<br />
<br />
In a commonly used statistical machine translation system (SMT), the goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes<br />
<br />
<math>p(\mathbf{f} | \mathbf{e})\propto p(\mathbf{e} | \mathbf{f})p(\mathbf{f})</math><br />
<br />
where the first term at the right hand side is called translation model and the latter language model (see, e.g., (Koehn, 2005)). In practice, however, most SMT systems model <math>\log p(\mathbf{f} | \mathbf{e})</math> as a loglinear model with additional features and corresponding weights:<br />
<br />
<math>\log p(\mathbf{f} | \mathbf{e})=\sum_{n=1}^Nw_nf_n(\mathbf{f},\mathbf{e})+\log Z(\mathbf{e})</math><br />
<br />
where <math>f_n</math> and <math>w_n</math> are the n-th feature and weight, respectively. <math>Z(\mathbf{e})</math> is a normalization constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a development set.<br />
<br />
Cho et al. propose to train the RNN Encoder–Decoder on a table of phrase pairs and use its scores as additional features in the loglinear model showed above when tuning the SMT decoder. For training the RNN-Encoder-Decoder, phrase frequency is ignored for several reasons: to reduce computation time, to ensure the model does not simply rank phrases by frequency, and because frequency information is already encoded in the features for the SMT (so it's better to not use the capacity of the RNN-Encoder-Decoder redundantly).<br />
<br />
=Alternative Models=<br />
The researchers noted a number of other potential translation models and their usability.<br />
<br />
The first model is by Schwenk and it is an application of a variant of the continuous space language model to the task of machine translation. The model is essentially a feedforward neural network with a common projection for input words encoded as bag of words vectors. Schwenk fixed the input and output sentence length and for a fixed length, the neural network attempts to estimate the probability of the output sequence of words and score potential translations. However, a major disadvantage is that the input and output length are fixed and cannot handle variable length inputs or outputs.<br />
<br />
The model figure<ref><br />
[Schwenk2012] Holger Schwenk. 2012. Continuous<br />
space translation models for phrase-based statistical<br />
machine translation. In Martin Kay and Christian<br />
Boitet, editors, Proceedings of the 24th International<br />
Conference on Computational Linguistics<br />
(COLIN), pages 1071–1080.<br />
</ref>:<br />
[[File:CONTINUOUS.PNG]]<br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. In building the model Cho et al. used baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref><br />
<br />
They tried the following combinations:<br />
1. Baseline configuration<br />
2. Baseline + RNN<br />
3. Baseline + CSLM + RNN<br />
4. Baseline + CSLM + RNN + Word penalty<br />
<br />
Results:<br />
<br />
<center><br />
[[File:encdec3.png |frame | center |Fig 3. BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks. ]]<br />
</center><br />
<br />
The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder–Decoder. This suggests that the contributions of the CSLM and the RNN Encoder– Decoder are not too correlated and that one can expect better results by improving each method independently<br />
<br />
<br />
== Word and Phrase Representations ==<br />
<br />
As the presented model maps sentences into a continuous space vector and prior continuous space language models have been known to learn semantically meaningful embeddings, one could expect this to happen for the presented model, too. This is indeed the case. When projecting to a 2D space (with Barnes-Hut-SNE), semantically similar words are clearly clustered.<br />
<br />
[[File:Fig4.png]]<br />
<br />
Phrases are also clustered capturing both semantic and syntactic structures.<br />
<br />
[[File:Fig5.png]]<br />
<br />
= References=<br />
<references /></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=27175learning Phrase Representations2015-12-12T17:39:35Z<p>Rtwang: /* References */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
Because each hidden unit has separate gates, it is possible for each hidden to unit to learn to capture dependencies over different lengths of time (determined by the frequency at which its reset and updates gates are active).<br />
<br />
=Scoring Phrase Pairs with RNN Encoder–Decoder =<br />
<br />
In a commonly used statistical machine translation system (SMT), the goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes<br />
<br />
<math>p(\mathbf{f} | \mathbf{e})\propto p(\mathbf{e} | \mathbf{f})p(\mathbf{f})</math><br />
<br />
where the first term at the right hand side is called translation model and the latter language model (see, e.g., (Koehn, 2005)). In practice, however, most SMT systems model <math>\log p(\mathbf{f} | \mathbf{e})</math> as a loglinear model with additional features and corresponding weights:<br />
<br />
<math>\log p(\mathbf{f} | \mathbf{e})=\sum_{n=1}^Nw_nf_n(\mathbf{f},\mathbf{e})+\log Z(\mathbf{e})</math><br />
<br />
where <math>f_n</math> and <math>w_n</math> are the n-th feature and weight, respectively. <math>Z(\mathbf{e})</math> is a normalization constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a development set.<br />
<br />
Cho et al. propose to train the RNN Encoder–Decoder on a table of phrase pairs and use its scores as additional features in the loglinear model showed above when tuning the SMT decoder. For training the RNN-Encoder-Decoder, phrase frequency is ignored for several reasons: to reduce computation time, to ensure the model does not simply rank phrases by frequency, and because frequency information is already encoded in the features for the SMT (so it's better to not use the capacity of the RNN-Encoder-Decoder redundantly).<br />
<br />
=Alternative Models=<br />
The researchers noted a number of other potential translation models and their usability.<br />
<br />
The first model is by Schwenk and it is an application of a variant of the continuous space language model to the task of machine translation. The model is essentially a feedforward neural network with a common projection for input words encoded as bag of words vectors. Schwenk fixed the input and output sentence length and for a fixed length, the neural network attempts to estimate the probability of the output sequence of words and score potential translations. However, a major disadvantage is that the input and output length are fixed and cannot handle variable length inputs or outputs.<br />
<br />
The model figure:<br />
[[File:CONTINUOUS.PNG]]<br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. In building the model Cho et al. used baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref><br />
<br />
They tried the following combinations:<br />
1. Baseline configuration<br />
2. Baseline + RNN<br />
3. Baseline + CSLM + RNN<br />
4. Baseline + CSLM + RNN + Word penalty<br />
<br />
Results:<br />
<br />
<center><br />
[[File:encdec3.png |frame | center |Fig 3. BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks. ]]<br />
</center><br />
<br />
The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder–Decoder. This suggests that the contributions of the CSLM and the RNN Encoder– Decoder are not too correlated and that one can expect better results by improving each method independently<br />
<br />
<br />
== Word and Phrase Representations ==<br />
<br />
As the presented model maps sentences into a continuous space vector and prior continuous space language models have been known to learn semantically meaningful embeddings, one could expect this to happen for the presented model, too. This is indeed the case. When projecting to a 2D space (with Barnes-Hut-SNE), semantically similar words are clearly clustered.<br />
<br />
[[File:Fig4.png]]<br />
<br />
Phrases are also clustered capturing both semantic and syntactic structures.<br />
<br />
[[File:Fig5.png]]<br />
<br />
= References=<br />
<references /></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=27174learning Phrase Representations2015-12-12T17:39:21Z<p>Rtwang: /* References */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
Because each hidden unit has separate gates, it is possible for each hidden to unit to learn to capture dependencies over different lengths of time (determined by the frequency at which its reset and updates gates are active).<br />
<br />
=Scoring Phrase Pairs with RNN Encoder–Decoder =<br />
<br />
In a commonly used statistical machine translation system (SMT), the goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes<br />
<br />
<math>p(\mathbf{f} | \mathbf{e})\propto p(\mathbf{e} | \mathbf{f})p(\mathbf{f})</math><br />
<br />
where the first term at the right hand side is called translation model and the latter language model (see, e.g., (Koehn, 2005)). In practice, however, most SMT systems model <math>\log p(\mathbf{f} | \mathbf{e})</math> as a loglinear model with additional features and corresponding weights:<br />
<br />
<math>\log p(\mathbf{f} | \mathbf{e})=\sum_{n=1}^Nw_nf_n(\mathbf{f},\mathbf{e})+\log Z(\mathbf{e})</math><br />
<br />
where <math>f_n</math> and <math>w_n</math> are the n-th feature and weight, respectively. <math>Z(\mathbf{e})</math> is a normalization constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a development set.<br />
<br />
Cho et al. propose to train the RNN Encoder–Decoder on a table of phrase pairs and use its scores as additional features in the loglinear model showed above when tuning the SMT decoder. For training the RNN-Encoder-Decoder, phrase frequency is ignored for several reasons: to reduce computation time, to ensure the model does not simply rank phrases by frequency, and because frequency information is already encoded in the features for the SMT (so it's better to not use the capacity of the RNN-Encoder-Decoder redundantly).<br />
<br />
=Alternative Models=<br />
The researchers noted a number of other potential translation models and their usability.<br />
<br />
The first model is by Schwenk and it is an application of a variant of the continuous space language model to the task of machine translation. The model is essentially a feedforward neural network with a common projection for input words encoded as bag of words vectors. Schwenk fixed the input and output sentence length and for a fixed length, the neural network attempts to estimate the probability of the output sequence of words and score potential translations. However, a major disadvantage is that the input and output length are fixed and cannot handle variable length inputs or outputs.<br />
<br />
The model figure:<br />
[[File:CONTINUOUS.PNG]]<br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. In building the model Cho et al. used baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref><br />
<br />
They tried the following combinations:<br />
1. Baseline configuration<br />
2. Baseline + RNN<br />
3. Baseline + CSLM + RNN<br />
4. Baseline + CSLM + RNN + Word penalty<br />
<br />
Results:<br />
<br />
<center><br />
[[File:encdec3.png |frame | center |Fig 3. BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks. ]]<br />
</center><br />
<br />
The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder–Decoder. This suggests that the contributions of the CSLM and the RNN Encoder– Decoder are not too correlated and that one can expect better results by improving each method independently<br />
<br />
<br />
== Word and Phrase Representations ==<br />
<br />
As the presented model maps sentences into a continuous space vector and prior continuous space language models have been known to learn semantically meaningful embeddings, one could expect this to happen for the presented model, too. This is indeed the case. When projecting to a 2D space (with Barnes-Hut-SNE), semantically similar words are clearly clustered.<br />
<br />
[[File:Fig4.png]]<br />
<br />
Phrases are also clustered capturing both semantic and syntactic structures.<br />
<br />
[[File:Fig5.png]]<br />
<br />
= References=<br />
<references /><br />
[Schwenk2012] Holger Schwenk. 2012. Continuous<br />
space translation models for phrase-based statistical<br />
machine translation. In Martin Kay and Christian<br />
Boitet, editors, Proceedings of the 24th International<br />
Conference on Computational Linguistics<br />
(COLIN), pages 1071–1080.</div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=27173learning Phrase Representations2015-12-12T17:39:11Z<p>Rtwang: /* Alternative Models */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
Because each hidden unit has separate gates, it is possible for each hidden to unit to learn to capture dependencies over different lengths of time (determined by the frequency at which its reset and updates gates are active).<br />
<br />
=Scoring Phrase Pairs with RNN Encoder–Decoder =<br />
<br />
In a commonly used statistical machine translation system (SMT), the goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes<br />
<br />
<math>p(\mathbf{f} | \mathbf{e})\propto p(\mathbf{e} | \mathbf{f})p(\mathbf{f})</math><br />
<br />
where the first term at the right hand side is called translation model and the latter language model (see, e.g., (Koehn, 2005)). In practice, however, most SMT systems model <math>\log p(\mathbf{f} | \mathbf{e})</math> as a loglinear model with additional features and corresponding weights:<br />
<br />
<math>\log p(\mathbf{f} | \mathbf{e})=\sum_{n=1}^Nw_nf_n(\mathbf{f},\mathbf{e})+\log Z(\mathbf{e})</math><br />
<br />
where <math>f_n</math> and <math>w_n</math> are the n-th feature and weight, respectively. <math>Z(\mathbf{e})</math> is a normalization constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a development set.<br />
<br />
Cho et al. propose to train the RNN Encoder–Decoder on a table of phrase pairs and use its scores as additional features in the loglinear model showed above when tuning the SMT decoder. For training the RNN-Encoder-Decoder, phrase frequency is ignored for several reasons: to reduce computation time, to ensure the model does not simply rank phrases by frequency, and because frequency information is already encoded in the features for the SMT (so it's better to not use the capacity of the RNN-Encoder-Decoder redundantly).<br />
<br />
=Alternative Models=<br />
The researchers noted a number of other potential translation models and their usability.<br />
<br />
The first model is by Schwenk and it is an application of a variant of the continuous space language model to the task of machine translation. The model is essentially a feedforward neural network with a common projection for input words encoded as bag of words vectors. Schwenk fixed the input and output sentence length and for a fixed length, the neural network attempts to estimate the probability of the output sequence of words and score potential translations. However, a major disadvantage is that the input and output length are fixed and cannot handle variable length inputs or outputs.<br />
<br />
The model figure:<br />
[[File:CONTINUOUS.PNG]]<br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. In building the model Cho et al. used baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref><br />
<br />
They tried the following combinations:<br />
1. Baseline configuration<br />
2. Baseline + RNN<br />
3. Baseline + CSLM + RNN<br />
4. Baseline + CSLM + RNN + Word penalty<br />
<br />
Results:<br />
<br />
<center><br />
[[File:encdec3.png |frame | center |Fig 3. BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks. ]]<br />
</center><br />
<br />
The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder–Decoder. This suggests that the contributions of the CSLM and the RNN Encoder– Decoder are not too correlated and that one can expect better results by improving each method independently<br />
<br />
<br />
== Word and Phrase Representations ==<br />
<br />
As the presented model maps sentences into a continuous space vector and prior continuous space language models have been known to learn semantically meaningful embeddings, one could expect this to happen for the presented model, too. This is indeed the case. When projecting to a 2D space (with Barnes-Hut-SNE), semantically similar words are clearly clustered.<br />
<br />
[[File:Fig4.png]]<br />
<br />
Phrases are also clustered capturing both semantic and syntactic structures.<br />
<br />
[[File:Fig5.png]]<br />
<br />
= References=<br />
<references /></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:CONTINUOUS.PNG&diff=27172File:CONTINUOUS.PNG2015-12-12T17:38:31Z<p>Rtwang: </p>
<hr />
<div></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=27171learning Phrase Representations2015-12-12T17:27:49Z<p>Rtwang: </p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
Because each hidden unit has separate gates, it is possible for each hidden to unit to learn to capture dependencies over different lengths of time (determined by the frequency at which its reset and updates gates are active).<br />
<br />
=Scoring Phrase Pairs with RNN Encoder–Decoder =<br />
<br />
In a commonly used statistical machine translation system (SMT), the goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes<br />
<br />
<math>p(\mathbf{f} | \mathbf{e})\propto p(\mathbf{e} | \mathbf{f})p(\mathbf{f})</math><br />
<br />
where the first term at the right hand side is called translation model and the latter language model (see, e.g., (Koehn, 2005)). In practice, however, most SMT systems model <math>\log p(\mathbf{f} | \mathbf{e})</math> as a loglinear model with additional features and corresponding weights:<br />
<br />
<math>\log p(\mathbf{f} | \mathbf{e})=\sum_{n=1}^Nw_nf_n(\mathbf{f},\mathbf{e})+\log Z(\mathbf{e})</math><br />
<br />
where <math>f_n</math> and <math>w_n</math> are the n-th feature and weight, respectively. <math>Z(\mathbf{e})</math> is a normalization constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a development set.<br />
<br />
Cho et al. propose to train the RNN Encoder–Decoder on a table of phrase pairs and use its scores as additional features in the loglinear model showed above when tuning the SMT decoder. For training the RNN-Encoder-Decoder, phrase frequency is ignored for several reasons: to reduce computation time, to ensure the model does not simply rank phrases by frequency, and because frequency information is already encoded in the features for the SMT (so it's better to not use the capacity of the RNN-Encoder-Decoder redundantly).<br />
<br />
=Alternative Models=<br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. In building the model Cho et al. used baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref><br />
<br />
They tried the following combinations:<br />
1. Baseline configuration<br />
2. Baseline + RNN<br />
3. Baseline + CSLM + RNN<br />
4. Baseline + CSLM + RNN + Word penalty<br />
<br />
Results:<br />
<br />
<center><br />
[[File:encdec3.png |frame | center |Fig 3. BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks. ]]<br />
</center><br />
<br />
The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder–Decoder. This suggests that the contributions of the CSLM and the RNN Encoder– Decoder are not too correlated and that one can expect better results by improving each method independently<br />
<br />
<br />
== Word and Phrase Representations ==<br />
<br />
As the presented model maps sentences into a continuous space vector and prior continuous space language models have been known to learn semantically meaningful embeddings, one could expect this to happen for the presented model, too. This is indeed the case. When projecting to a 2D space (with Barnes-Hut-SNE), semantically similar words are clearly clustered.<br />
<br />
[[File:Fig4.png]]<br />
<br />
Phrases are also clustered capturing both semantic and syntactic structures.<br />
<br />
[[File:Fig5.png]]<br />
<br />
= References=<br />
<references /></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Sparse_Rectifier_Neural_Networks&diff=27170deep Sparse Rectifier Neural Networks2015-12-12T17:20:19Z<p>Rtwang: /* Advantages of rectified linear units */</p>
<hr />
<div>= Introduction =<br />
<br />
Two trends in Deep Learning can be seen in terms of architecture improvements. The first is increasing sparsity (for example, see convolutional neural nets) and increasing biological plausibility (biologically plausible sigmoid neurons performing better than tanh neurons). Rectified linear neurons are good for sparsity and for biological plausibility, thus should increase performance.<br />
<br />
In this paper they show that two common gaps between computational neuroscience models and machine learning neural network models can be bridged by rectifier activation function. One is between deep networks learnt with and without unsupervised pre-training; the other one is between the activation function and sparsity in neural networks.<br />
<br />
== Biological Plausibility and Sparsity ==<br />
<br />
In the brain, neurons rarely fire at the same time as a way to balance quality of representation and energy conservation. This is in stark contrast to sigmoid neurons which fire at 1/2 of their maximum rate when at zero. A solution to this problem is to use a rectifier neuron which does not fire at it's zero value. This rectifier linear unit is inspired by a common biological model of neuron, the leaky integrate-and-fire model (LIF), proposed by Dayan and Abott<ref><br />
Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems<br />
</ref>. It's function is illustrated in the figure below (middle).<br />
<br />
<gallery><br />
Image:sig_neuron.png|Sigmoid and TANH Neuron<br />
Image:lif_neuron.png|Leaky Integrate Fire Neuron<br />
Image:rect_neuron.png|Rectified Linear Neuron<br />
</gallery><br />
<br />
Given that the rectifier neuron has a larger range of inputs that will be output as zero, it's representation will obviously be more sparse. In the paper, the two most salient advantages of sparsity are:<br />
<br />
- '''Information Disentangling''' As opposed to a dense representation, where every slight input change results in a considerable output change, a the non-zero items of a sparse representation remain almost constant to slight input changes.<br />
<br />
- '''Variable Dimensionality''' A sparse representation can effectively choose how many dimensions to use to represent a variable, since it choose how many non-zero elements to contribute. Thus, the precision is variable, allowing for more efficient representation of complex items.<br />
<br />
Further benefits of a sparse representation and rectified linear neurons in particular are better linear separability (because the input is represented in a higher-dimensional space) and less computational complexity (most units are off and for on-units only a linear functions has to be computed).<br />
<br />
However, it should also be noted that sparsity reduces the capacity of the model because each unit takes part in the representation of fewer values.<br />
<br />
== Advantages of rectified linear units ==<br />
<br />
The rectifier activation function <math>\,max(0, x)</math> allows a network to easily obtain sparse representations since only a subset of hidden units will have a non-zero activation value for some given input and this sparsity can be further increased through regularization methods. Therefore, the rectified linear activation function will utilize the advantages listed in the previous section for sparsity.<br />
<br />
For a given input, only a subset of hidden units in each layer will have non-zero activation values. The rest of the hidden units will have zero and they are essentially turned off. Each hidden unit activation value is then composed of a linear combination of the active (non-zero) hidden units in the previous layer due to the linearity of the rectified linear function. By repeating this through each layer, one can see that the neural network is actually an exponentially increasing number of linear models who share parameters since the later layers will use the same values from the earlier layers. Since the network is linear, the gradient is easy to calculate and compute and travels back through the active nodes without vanishing gradient problem caused by non-linear sigmoid or tanh functions. <br />
<br />
The sparsity and linear model can be seen in the figure the researchers made:<br />
<br />
[[File:RLU.PNG]]<br />
<br />
Each layer is a linear combination of the previous layer.<br />
<br />
== Potential problems of rectified linear units ==<br />
<br />
The zero derivative below zero in the rectified neurons blocks the back-propagation of the gradient during learning. Using a smooth variant of the rectification non-linearity (the softplus activation) this effect was investigated. Surprisingly, the results suggest the hard rectifications performs better. The authors hypothesize that the hard rectification is not a problem as long as the gradient can be propagated along some paths through the network and that the complete shut-off with the hard rectification sharpens the credit attribution to neurons in the learning phase.<br />
<br />
Furthermore, the unbounded nature of the rectification non-linearity can lead to numerical instabilities if activations grow too large. To circumvent this a <math>L_1</math> regularizer is used. Also, if symmetry is required, this can be obtained by using two rectifier units with shared parameters, but requires twice as many hidden units as a network with a symmetric activation function.<br />
<br />
Finally, rectifier networks are subject to ill conditioning of the parametrization. Biases and weights can be scaled in different (and consistent) ways while preserving the same overall network function.<br />
<br />
= Experiments =<br />
<br />
Networks with rectifier neurons were applied to the domains of image recognition and sentiment analysis. The datasets for image recognition included both black and white (MNIST, NISTP), colour (CIFAR10) and stereo (NORB) images.<br />
<br />
The datasets for sentiment analysis were taken from opentable.com and Amazon. The task of both was to predict the star rating based off the text blurb of the review.<br />
<br />
== Results ==<br />
<br />
'''Results from image classification'''<br />
[[File:rectifier_res_1.png]]<br />
<br />
'''Results from sentiment classification'''<br />
[[File:rectifier_res_2.png]]<br />
<br />
For image recognition task, they find that there is almost no improvement when using unsupervised pre-training with rectifier activations, contrary to what is experienced using tanh or softplus. However, it achieves best performance when the network is trained Without unsupervised pre-training.<br />
<br />
In the NORB and sentiment analysis cases, the network benefited greatly from pre-training. However, the benefit in NORB diminished as the training set size grew.<br />
<br />
The result from the Amazon dataset was 78.95%, while the state of the art was 73.72%.<br />
<br />
The sparsity achieved with the rectified linear neurons helps to diminish the gap between networks with unsupervised pre-training and no pre-training.<br />
<br />
== Discussion / Criticism ==<br />
<br />
* Rectifier neurons really aren't biologically plausible for a variety of reasons. Namely, the neurons in the cortex do not have tuning curves resembling the rectifier. Additionally, the ideal sparsity of the rectifier networks were from 50 to 80%, while the brain is estimated to have a sparsity of around 95 to 99%.<br />
<br />
* The Sparsity property encouraged by ReLu is a double edged sword, while sparsity encourages information disentangling, efficient variable-size representation, linear separability, increased robustness as suggested by the author of this paper, <ref>Szegedy, Christian, et al. "Going deeper with convolutions." arXiv preprint arXiv:1409.4842 (2014).</ref> argues that computing sparse non-uniform data structures is very inefficient, the overhead and cache-misses would make it computationally expensive to justify using sparse data structures.<br />
<br />
* ReLu does not have vanishing gradient problems<br />
<br />
* ReLu can be prone to "die", in other words it may output same value regardless of what input you give the ReLu unit. This occurs when a large negative bias to the unit is learnt causing the output of the ReLu to be zero, thus getting stuck at zero because gradient at zero is zero. Solutions to mitigate this problem include techniques such as Leaky ReLu and Maxout.<br />
<br />
= Bibliography =<br />
<references /></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:RLU.PNG&diff=27169File:RLU.PNG2015-12-12T17:19:42Z<p>Rtwang: </p>
<hr />
<div></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_difficulty_of_training_recurrent_neural_networks&diff=27075on the difficulty of training recurrent neural networks2015-12-04T04:26:59Z<p>Rtwang: /* The Temporal Order Problem */</p>
<hr />
<div>= Introduction =<br />
<br />
Training Recurrent Neural Network (RNN) is difficult and two of the most prominent problems have been vanishing and exploding gradients, <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages<br />
1183–1188. IEEE, 1993.</ref> which prevents neural networks from learning and fitting data with long-term dependencies. In this paper the authors propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem.<br />
<br />
= Background =<br />
<br />
[[Image:rnn_2.png|frame|center|400px|alt=| Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>]]<br />
A generic recurrent neural network has the form:<br />
<br />
<math>x_{t} = F(\mathbf{x}_{t -1}, \mathbf{u}_{t}, \theta)</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{x}_{t}</math> is the state at time <math>t</math></span><br />
* <span><math>\mathbf{u}_{t}</math> is the input at time <math>t</math></span><br />
* <span><math>\theta\,</math> are the parameters</span><br />
* <span><math>F()\,</math> is the function that represents a neuron</span><br />
<br />
In the theoreical sections the authors made use of specific parameterization:<br />
<br />
<math>\mathbf{x}_{t} = \mathbf{W}_{rec} \sigma(\mathbf{x}_{t - 1}) + \mathbf{W}_{in} \mathbf{u}_{t} + b</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{W}_{rec}</math> is the RNN weight matrix</span><br />
* <span><math>\sigma()\,</math> is an element wise function</span><br />
* <span><math>b\,</math> is the bias</span><br />
* <span><math>\mathbf{W}_{in}</math> is the input weights matrix</span><br />
<br />
The following are gradient equations for using the Back Propagation Through Time (BPTT) algorithm. The authors rewrote the equations in order to highlight the exploding gradients problem:<br />
<br />
<math>\frac{\partial \varepsilon}{\partial \theta} = <br />
\sum_{1 \leq t \leq T} \frac{\partial \varepsilon_t}{\partial \theta}</math><br />
<br />
<math>\frac{\partial \varepsilon_{t}}{\partial \theta} = <br />
\sum_{1 \leq k \leq T} <br />
\left(<br />
\frac{\partial \varepsilon_{t}}{\partial x_{t}}<br />
\frac{\partial x_{t}}{\partial x_{k}}<br />
\frac{\partial^{+} x_{k}}{\partial \theta}<br />
\right)</math><br />
<br />
<math>\frac{\partial x_{t}}{\partial x_{k}} =<br />
\prod_{t \leq i \leq k} \frac{\partial x_{i}}{\partial x_{i - 1}} =<br />
\prod_{t \leq i \leq k} <br />
\mathbf{W}^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1}))</math><br />
<br />
Where:<br />
<br />
* <span><math>\varepsilon_{t}</math> is the error obtained from output at time <math>t</math></span><br />
* <span><math>\frac{\partial^{+} \mathbf{x}_{k}}{\partial \theta}</math> is the immediate partial derivative of state <math>\mathbf{x}_{k}</math></span>. For the parameterization above, <math>\frac{\partial^+ \mathbf{x}_k}{\partial \mathbf{W}_{rec}} = \sigma(\mathbf{x}_{k-1})</math>.<br />
<br />
The authors of this paper also distinguish between ''long-term'' and ''short-term'' contributions to the gradient with respect to <math>\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_k}</math>. The contribution is ''long-term'' if <math>k \ll t</math>, and ''short-term'' otherwise.<br />
<br />
== Exploding and Vanishing Gradients ==<br />
=== The Mechanics ===<br />
<br />
It's known that <math> |\sigma^'(x)| </math> is bounded. Let <math>\left|\left|diag(\sigma^'(x_k))\right|\right| \leq \gamma \in R</math>.<br />
<br />
The paper first proves that it is sufficient for <math> \lambda_1 < \frac{1}{\gamma} </math>, where <math> \lambda_1 </math> is the largest singular value of <math> \bold{W}_{rec} </math>, for the vanishing gradient problem to occur. The Jacobian matrix <math> \frac{\partial x_{k+1}}{\partial x_k} </math> is given by <math> \bold{ W}_{rec}^{T}diag(\sigma^'(x_k)) </math>. Then, the 2-norm of this Jacobian is bounded by the product of the norms of the two matrices. This leads to <math> \forall k, ||\frac{\partial{x_{k+1}}}{\partial x_k}|| \leq ||\bold{W}_{rec}^T||||diag(\sigma^'(x_k))|| < \frac{1}{\gamma}\gamma < 1</math><br />
<br />
Let <math>\eta \in R</math> be such that <math>\forall k, ||\frac{\partial {x_{k+1}}}{\partial x_k}|| \leq \eta < 1</math>. By induction over <math>i</math>, we can show that <math>||\frac{\partial \varepsilon_t}{\partial x_t}(\prod_{i=k}^{t-1}{\frac{\partial x_{i+1}}{\partial x_i}})|| \leq \eta^{t-k}||\frac{\partial \varepsilon_t}{\partial x_t}||</math>. Since <math> \eta < 1 </math>, as <math> t-k </math> goes larger, the gradient goes to 0.<br />
<br />
By inverting this proof, it also shows that when the largest singular value <math>\lambda_1 </math> is larger than <math> \frac{1}{\gamma}</math>, we will have exploding gradients.<br />
<br />
=== From a dynamical systems perspective ===<br />
<br />
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “<math>\theta</math> changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.<br />
<br />
The authors of this paper argue, however, that crossing these bifurcation points does not guarantee a sudden change in gradients. Their idea is that a change to the model parameters can alter the attractor landscape in such a way that basin of attraction corresponding to the current model state is unaltered. For example, a change to the model parameters might eliminate a basic of attraction in a portion of the model's state space that is very far from its current state. In this case, the bifurcation will have no effect on the asymptotic behaviour of the model, and there will accordingly be no gradient explosion. On the other hand, if a change to the model parameters substantially alters the final basin of attraction given the current state, then there will a considerable effect on the asymptotic behaviour of the model, and the gradients will explode accordingly. <br />
<br />
[[Image:dynamic_perspective.png|frame|center|400px|alt=|Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>]]<br />
<br />
The figure above depicts a bifurcation diagram for a single-unit RNN, where the x-axis is the parameter <math>b</math> (bias) and the y-axis is the asymptotic state <math>x_{\infty}</math>, (i.e. the equilibrium activation value of the unit), and the plot line is the movement of the final point attractor <math>x_{\infty}</math> as <math>b</math> changes. What this figure represents is the presence of two attractors, one emerging from <math>b_1</math> and another disappearing at <math>b_2</math>, as the value of <math>b</math> is decreased. Note that only one attractor exists when the value of <math>b</math> is outside of the interval between <math>b_1</math> and <math>b_2</math>, and that when two attractors exist, the attractor state towards which the unit ultimately gravitates is determined by its initial starting state. The boundary between the these two basins of attraction is denoted with the dashed line - starting states on opposite sides of this boundary will gravitate towards different attractor states. The blue filled circles indicate a bifurcation point at which a small change to the value of <math>b</math> can have a drastic effect on the attractor landscape over the unit's state space. In short, the landscape shifts to include a single attractor state for a low value of <math>x</math>. The unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time <math>0</math>, a small change in <math>b</math> would result in a sudden large change in <math>x_{t}</math>.<br />
<br />
Overall, these remarks indicate that, when treated as dynamical system, the behaviour of a RNN can be analyzed with respect to both changes to the parameter values that determine an attractor landscape over its state space (assuming a fixed starting state), and with respect to changes to the starting state (assuming a fixed attractor landscape).<br />
<br />
=== From a geometric perspective ===<br />
<br />
Aside from a dynamical systems prospective to exploding and vanishing gradients, the authors also considered a geometric perspective, where a simple one hidden unit RNN was considered.<br />
<br />
[[Image:geometric_perspective.png|frame|center|400px|alt=|Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>]]<br />
<br />
Reusing the RNN state equation the authors defined:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
By assuming no input, with <math>b = 0</math> and initial state <math>x_{0}</math>, the equation simplifies to:<br />
<br />
<math>x_{t} = W_{rec}^{t} x_{0}</math><br />
<br />
Differentiating the above equation to the first and second order would give:<br />
<br />
<math>\frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0}</math><br />
<br />
<math>\frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0}</math><br />
<br />
Which implies if the first order derivative explodes then so will the second derivative. In the general case, when the gradients explode they do so along some directions '''v'''. If this bound is tight, it is hypothesized that ''when gradients explode so does the curvature along'' '''v''''', leading to a wall in the error surface'', like the one seen above. If both the gradient and the leading eigenvector of the curvature are aligned with the exploding direction '''v''', it follows that the error surface has a steep wall perpendicular to '''v''' (and consequently to the gradient). This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away and possibly hinder the learning process. (See figure above). Note that this solution assumes that the valley bordered by a steep cliff in the value of the loss function is wide enough with respect the clip being applied to the gradient, otherwise the deflection caused by an update of SGD would still hinder the learning process despite clipping being used. The practical effectiveness of clipping provides some evidence in support of this assumption.<br />
<br />
= Dealing with Exploding and Vanishing Gradient =<br />
<br />
== Related Work ==<br />
<br />
* <span>'''L1/L2 Regularization''': Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.</span><br />
* <span>'''Teacher Forcing''': Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.</span><br />
* <span>'''LTSM''': The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of <math>1</math>. This solution however does not deal with the exploding gradient</span><br />
* <span>'''Hessian-Free optimizer with structural damping''': Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.</span><br />
* <span>'''Echo State Networks''': <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.</span><br />
<br />
== Scaling Down the Gradients ==<br />
<br />
[[Image:gradient_clipping.png|frame|center|400px]]<br />
<br />
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.<br />
<br />
== Vanishing Gradient Regularization ==<br />
<br />
The vanishing gradient regularizer is as follows:<br />
<br />
<math>\Omega <br />
= \sum_{k} \Omega_{k} <br />
= \sum_{k} <br />
\left( <br />
\frac{<br />
\| <br />
\frac{\delta \varepsilon}{\delta x_{k + 1}} <br />
\frac{\delta x_{k + 1}}{\delta x_{k}}<br />
\|<br />
}<br />
{<br />
\|<br />
\frac{\delta \varepsilon}{\delta x_{k + 1}}<br />
\| <br />
} - 1<br />
\right)^{2}</math><br />
<br />
The vanishing gradient problem occurs when at time <math>t</math> the inputs <math>u</math> may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs <math>u_{t} \dots u_{k}</math> could be increased by increasing the norm of <math>\frac{\delta x_t}{\delta x_{t}}</math>. This imples that in order to increses the <math>\frac{\delta x_t}{\delta x_{t}}</math> norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices <math>\frac{\delta x_{k + 1}}{\delta x_{k}}</math> to preserve norm in the direction of the error <math>\frac{\delta \varepsilon}{\delta x_{k + 1}}</math>.<br />
<br />
= Experimental Results =<br />
<br />
== The Temporal Order Problem ==<br />
<br />
The authors repeated the temporal order problem as the prototypical pathological problem for validating the clipping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an <math>A</math> or a <math>B</math> symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of <math>A</math> and <math>B</math> at the end of the sequence.<br />
<br />
Three different RNN intializations were performed for the experiment:<br />
<br />
* <span>'''sigmoid unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
* <span>'''basic tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1)</math></span><br />
* <span>'''smart tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
<br />
Of the three RNN networks three different optimizer configurations were used:<br />
<br />
* <span>'''MSGD''': Mini-batch Stochastic Gradient Decent</span><br />
* <span>'''MSGD-C''': MSGD with Gradient Clipping</span><br />
* <span>'''MSGD-CR''': MSGD-C with Regularization</span><br />
<br />
Additional model parameters include:<br />
<br />
* <span>Spectral Radius of 0.95</span><br />
* <span><math>b = b_{out} = 0</math></span><br />
* <span>50 hidden neurons</span><br />
* <span>constant learning rate of 0.01</span><br />
* <span>clipping threshold of 1 (only for MSGD-C and MSGD-CR)</span><br />
* <span>regularization weight of 4 (MSGD-CR)</span><br />
<br />
The experiment was performed 5 times and from the figure below we can observe the importance of gradient clipping and the regularizer. In all cases, the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding gradients correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important. This is largely due to the fact that increased memory yields a larger spectral radius, which in turn leads to increased likelihood of gradient explosion.<br />
<br />
[[Image:experimental_results.png|frame|center|400px]]<br />
<br />
== Other Pathological Problems ==<br />
<br />
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:<br />
<br />
[[Image:experimental_results_2.png|image]]<br />
<br />
The authors did not discuss the experimental results in detail here.<br />
<br />
= Summary =<br />
<br />
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer and their experimental results showed that in all cases except for the Penn Treebank dataset, clipping and regularizer has improved on the results for the RNNs in their respective experiments.</div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_difficulty_of_training_recurrent_neural_networks&diff=27074on the difficulty of training recurrent neural networks2015-12-04T04:24:47Z<p>Rtwang: /* Summary */</p>
<hr />
<div>= Introduction =<br />
<br />
Training Recurrent Neural Network (RNN) is difficult and two of the most prominent problems have been vanishing and exploding gradients, <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages<br />
1183–1188. IEEE, 1993.</ref> which prevents neural networks from learning and fitting data with long-term dependencies. In this paper the authors propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem.<br />
<br />
= Background =<br />
<br />
[[Image:rnn_2.png|frame|center|400px|alt=| Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>]]<br />
A generic recurrent neural network has the form:<br />
<br />
<math>x_{t} = F(\mathbf{x}_{t -1}, \mathbf{u}_{t}, \theta)</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{x}_{t}</math> is the state at time <math>t</math></span><br />
* <span><math>\mathbf{u}_{t}</math> is the input at time <math>t</math></span><br />
* <span><math>\theta\,</math> are the parameters</span><br />
* <span><math>F()\,</math> is the function that represents a neuron</span><br />
<br />
In the theoreical sections the authors made use of specific parameterization:<br />
<br />
<math>\mathbf{x}_{t} = \mathbf{W}_{rec} \sigma(\mathbf{x}_{t - 1}) + \mathbf{W}_{in} \mathbf{u}_{t} + b</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{W}_{rec}</math> is the RNN weight matrix</span><br />
* <span><math>\sigma()\,</math> is an element wise function</span><br />
* <span><math>b\,</math> is the bias</span><br />
* <span><math>\mathbf{W}_{in}</math> is the input weights matrix</span><br />
<br />
The following are gradient equations for using the Back Propagation Through Time (BPTT) algorithm. The authors rewrote the equations in order to highlight the exploding gradients problem:<br />
<br />
<math>\frac{\partial \varepsilon}{\partial \theta} = <br />
\sum_{1 \leq t \leq T} \frac{\partial \varepsilon_t}{\partial \theta}</math><br />
<br />
<math>\frac{\partial \varepsilon_{t}}{\partial \theta} = <br />
\sum_{1 \leq k \leq T} <br />
\left(<br />
\frac{\partial \varepsilon_{t}}{\partial x_{t}}<br />
\frac{\partial x_{t}}{\partial x_{k}}<br />
\frac{\partial^{+} x_{k}}{\partial \theta}<br />
\right)</math><br />
<br />
<math>\frac{\partial x_{t}}{\partial x_{k}} =<br />
\prod_{t \leq i \leq k} \frac{\partial x_{i}}{\partial x_{i - 1}} =<br />
\prod_{t \leq i \leq k} <br />
\mathbf{W}^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1}))</math><br />
<br />
Where:<br />
<br />
* <span><math>\varepsilon_{t}</math> is the error obtained from output at time <math>t</math></span><br />
* <span><math>\frac{\partial^{+} \mathbf{x}_{k}}{\partial \theta}</math> is the immediate partial derivative of state <math>\mathbf{x}_{k}</math></span>. For the parameterization above, <math>\frac{\partial^+ \mathbf{x}_k}{\partial \mathbf{W}_{rec}} = \sigma(\mathbf{x}_{k-1})</math>.<br />
<br />
The authors of this paper also distinguish between ''long-term'' and ''short-term'' contributions to the gradient with respect to <math>\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_k}</math>. The contribution is ''long-term'' if <math>k \ll t</math>, and ''short-term'' otherwise.<br />
<br />
== Exploding and Vanishing Gradients ==<br />
=== The Mechanics ===<br />
<br />
It's known that <math> |\sigma^'(x)| </math> is bounded. Let <math>\left|\left|diag(\sigma^'(x_k))\right|\right| \leq \gamma \in R</math>.<br />
<br />
The paper first proves that it is sufficient for <math> \lambda_1 < \frac{1}{\gamma} </math>, where <math> \lambda_1 </math> is the largest singular value of <math> \bold{W}_{rec} </math>, for the vanishing gradient problem to occur. The Jacobian matrix <math> \frac{\partial x_{k+1}}{\partial x_k} </math> is given by <math> \bold{ W}_{rec}^{T}diag(\sigma^'(x_k)) </math>. Then, the 2-norm of this Jacobian is bounded by the product of the norms of the two matrices. This leads to <math> \forall k, ||\frac{\partial{x_{k+1}}}{\partial x_k}|| \leq ||\bold{W}_{rec}^T||||diag(\sigma^'(x_k))|| < \frac{1}{\gamma}\gamma < 1</math><br />
<br />
Let <math>\eta \in R</math> be such that <math>\forall k, ||\frac{\partial {x_{k+1}}}{\partial x_k}|| \leq \eta < 1</math>. By induction over <math>i</math>, we can show that <math>||\frac{\partial \varepsilon_t}{\partial x_t}(\prod_{i=k}^{t-1}{\frac{\partial x_{i+1}}{\partial x_i}})|| \leq \eta^{t-k}||\frac{\partial \varepsilon_t}{\partial x_t}||</math>. Since <math> \eta < 1 </math>, as <math> t-k </math> goes larger, the gradient goes to 0.<br />
<br />
By inverting this proof, it also shows that when the largest singular value <math>\lambda_1 </math> is larger than <math> \frac{1}{\gamma}</math>, we will have exploding gradients.<br />
<br />
=== From a dynamical systems perspective ===<br />
<br />
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “<math>\theta</math> changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.<br />
<br />
The authors of this paper argue, however, that crossing these bifurcation points does not guarantee a sudden change in gradients. Their idea is that a change to the model parameters can alter the attractor landscape in such a way that basin of attraction corresponding to the current model state is unaltered. For example, a change to the model parameters might eliminate a basic of attraction in a portion of the model's state space that is very far from its current state. In this case, the bifurcation will have no effect on the asymptotic behaviour of the model, and there will accordingly be no gradient explosion. On the other hand, if a change to the model parameters substantially alters the final basin of attraction given the current state, then there will a considerable effect on the asymptotic behaviour of the model, and the gradients will explode accordingly. <br />
<br />
[[Image:dynamic_perspective.png|frame|center|400px|alt=|Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>]]<br />
<br />
The figure above depicts a bifurcation diagram for a single-unit RNN, where the x-axis is the parameter <math>b</math> (bias) and the y-axis is the asymptotic state <math>x_{\infty}</math>, (i.e. the equilibrium activation value of the unit), and the plot line is the movement of the final point attractor <math>x_{\infty}</math> as <math>b</math> changes. What this figure represents is the presence of two attractors, one emerging from <math>b_1</math> and another disappearing at <math>b_2</math>, as the value of <math>b</math> is decreased. Note that only one attractor exists when the value of <math>b</math> is outside of the interval between <math>b_1</math> and <math>b_2</math>, and that when two attractors exist, the attractor state towards which the unit ultimately gravitates is determined by its initial starting state. The boundary between the these two basins of attraction is denoted with the dashed line - starting states on opposite sides of this boundary will gravitate towards different attractor states. The blue filled circles indicate a bifurcation point at which a small change to the value of <math>b</math> can have a drastic effect on the attractor landscape over the unit's state space. In short, the landscape shifts to include a single attractor state for a low value of <math>x</math>. The unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time <math>0</math>, a small change in <math>b</math> would result in a sudden large change in <math>x_{t}</math>.<br />
<br />
Overall, these remarks indicate that, when treated as dynamical system, the behaviour of a RNN can be analyzed with respect to both changes to the parameter values that determine an attractor landscape over its state space (assuming a fixed starting state), and with respect to changes to the starting state (assuming a fixed attractor landscape).<br />
<br />
=== From a geometric perspective ===<br />
<br />
Aside from a dynamical systems prospective to exploding and vanishing gradients, the authors also considered a geometric perspective, where a simple one hidden unit RNN was considered.<br />
<br />
[[Image:geometric_perspective.png|frame|center|400px|alt=|Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>]]<br />
<br />
Reusing the RNN state equation the authors defined:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
By assuming no input, with <math>b = 0</math> and initial state <math>x_{0}</math>, the equation simplifies to:<br />
<br />
<math>x_{t} = W_{rec}^{t} x_{0}</math><br />
<br />
Differentiating the above equation to the first and second order would give:<br />
<br />
<math>\frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0}</math><br />
<br />
<math>\frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0}</math><br />
<br />
Which implies if the first order derivative explodes then so will the second derivative. In the general case, when the gradients explode they do so along some directions '''v'''. If this bound is tight, it is hypothesized that ''when gradients explode so does the curvature along'' '''v''''', leading to a wall in the error surface'', like the one seen above. If both the gradient and the leading eigenvector of the curvature are aligned with the exploding direction '''v''', it follows that the error surface has a steep wall perpendicular to '''v''' (and consequently to the gradient). This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away and possibly hinder the learning process. (See figure above). Note that this solution assumes that the valley bordered by a steep cliff in the value of the loss function is wide enough with respect the clip being applied to the gradient, otherwise the deflection caused by an update of SGD would still hinder the learning process despite clipping being used. The practical effectiveness of clipping provides some evidence in support of this assumption.<br />
<br />
= Dealing with Exploding and Vanishing Gradient =<br />
<br />
== Related Work ==<br />
<br />
* <span>'''L1/L2 Regularization''': Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.</span><br />
* <span>'''Teacher Forcing''': Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.</span><br />
* <span>'''LTSM''': The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of <math>1</math>. This solution however does not deal with the exploding gradient</span><br />
* <span>'''Hessian-Free optimizer with structural damping''': Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.</span><br />
* <span>'''Echo State Networks''': <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.</span><br />
<br />
== Scaling Down the Gradients ==<br />
<br />
[[Image:gradient_clipping.png|frame|center|400px]]<br />
<br />
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.<br />
<br />
== Vanishing Gradient Regularization ==<br />
<br />
The vanishing gradient regularizer is as follows:<br />
<br />
<math>\Omega <br />
= \sum_{k} \Omega_{k} <br />
= \sum_{k} <br />
\left( <br />
\frac{<br />
\| <br />
\frac{\delta \varepsilon}{\delta x_{k + 1}} <br />
\frac{\delta x_{k + 1}}{\delta x_{k}}<br />
\|<br />
}<br />
{<br />
\|<br />
\frac{\delta \varepsilon}{\delta x_{k + 1}}<br />
\| <br />
} - 1<br />
\right)^{2}</math><br />
<br />
The vanishing gradient problem occurs when at time <math>t</math> the inputs <math>u</math> may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs <math>u_{t} \dots u_{k}</math> could be increased by increasing the norm of <math>\frac{\delta x_t}{\delta x_{t}}</math>. This imples that in order to increses the <math>\frac{\delta x_t}{\delta x_{t}}</math> norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices <math>\frac{\delta x_{k + 1}}{\delta x_{k}}</math> to preserve norm in the direction of the error <math>\frac{\delta \varepsilon}{\delta x_{k + 1}}</math>.<br />
<br />
= Experimental Results =<br />
<br />
== The Temporal Order Problem ==<br />
<br />
The authors repeated the temporal order problem as the prototypical pathological problem for validating the cliping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an <math>A</math> or a <math>B</math> symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of <math>A</math> and <math>B</math> at the end of the sequence.<br />
<br />
Three different RNN intializations were performed for the experiment:<br />
<br />
* <span>'''sigmoid unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
* <span>'''basic tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1)</math></span><br />
* <span>'''smart tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
<br />
Of the three RNN networks three different optimizer configurations were used:<br />
<br />
* <span>'''MSGD''': Mini-batch Stochastic Gradient Decent</span><br />
* <span>'''MSGD-C''': MSGD with Gradient Clipping</span><br />
* <span>'''MSGD-CR''': MSGD-C with Regularization</span><br />
<br />
Additional model parameters include:<br />
<br />
* <span>Spectral Radius of 0.95</span><br />
* <span><math>b = b_{out} = 0</math></span><br />
* <span>50 hidden neurons</span><br />
* <span>constant learning rate of 0.01</span><br />
* <span>clipping threshold of 1 (only for MSGD-C and MSGD-CR)</span><br />
* <span>regularization weight of 4 (MSGD-CR)</span><br />
<br />
The experiment was performed 5 times, from the figure below we can observe the importance of gradient cliping and the regularizer, in all cases the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding graidents correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important. This is largely due to the fact that increased memory yields a larger spectral radius, which in turn leads to increased likelihood of gradient explosion.<br />
<br />
[[Image:experimental_results.png|frame|center|400px]]<br />
<br />
== Other Pathological Problems ==<br />
<br />
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:<br />
<br />
[[Image:experimental_results_2.png|image]]<br />
<br />
The authors did not discuss the experimental results in detail here.<br />
<br />
= Summary =<br />
<br />
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer and their experimental results showed that in all cases except for the Penn Treebank dataset, clipping and regularizer has improved on the results for the RNNs in their respective experiments.</div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_difficulty_of_training_recurrent_neural_networks&diff=27073on the difficulty of training recurrent neural networks2015-12-04T04:22:51Z<p>Rtwang: /* From a geometric perspective */</p>
<hr />
<div>= Introduction =<br />
<br />
Training Recurrent Neural Network (RNN) is difficult and two of the most prominent problems have been vanishing and exploding gradients, <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages<br />
1183–1188. IEEE, 1993.</ref> which prevents neural networks from learning and fitting data with long-term dependencies. In this paper the authors propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem.<br />
<br />
= Background =<br />
<br />
[[Image:rnn_2.png|frame|center|400px|alt=| Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>]]<br />
A generic recurrent neural network has the form:<br />
<br />
<math>x_{t} = F(\mathbf{x}_{t -1}, \mathbf{u}_{t}, \theta)</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{x}_{t}</math> is the state at time <math>t</math></span><br />
* <span><math>\mathbf{u}_{t}</math> is the input at time <math>t</math></span><br />
* <span><math>\theta\,</math> are the parameters</span><br />
* <span><math>F()\,</math> is the function that represents a neuron</span><br />
<br />
In the theoreical sections the authors made use of specific parameterization:<br />
<br />
<math>\mathbf{x}_{t} = \mathbf{W}_{rec} \sigma(\mathbf{x}_{t - 1}) + \mathbf{W}_{in} \mathbf{u}_{t} + b</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{W}_{rec}</math> is the RNN weight matrix</span><br />
* <span><math>\sigma()\,</math> is an element wise function</span><br />
* <span><math>b\,</math> is the bias</span><br />
* <span><math>\mathbf{W}_{in}</math> is the input weights matrix</span><br />
<br />
The following are gradient equations for using the Back Propagation Through Time (BPTT) algorithm. The authors rewrote the equations in order to highlight the exploding gradients problem:<br />
<br />
<math>\frac{\partial \varepsilon}{\partial \theta} = <br />
\sum_{1 \leq t \leq T} \frac{\partial \varepsilon_t}{\partial \theta}</math><br />
<br />
<math>\frac{\partial \varepsilon_{t}}{\partial \theta} = <br />
\sum_{1 \leq k \leq T} <br />
\left(<br />
\frac{\partial \varepsilon_{t}}{\partial x_{t}}<br />
\frac{\partial x_{t}}{\partial x_{k}}<br />
\frac{\partial^{+} x_{k}}{\partial \theta}<br />
\right)</math><br />
<br />
<math>\frac{\partial x_{t}}{\partial x_{k}} =<br />
\prod_{t \leq i \leq k} \frac{\partial x_{i}}{\partial x_{i - 1}} =<br />
\prod_{t \leq i \leq k} <br />
\mathbf{W}^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1}))</math><br />
<br />
Where:<br />
<br />
* <span><math>\varepsilon_{t}</math> is the error obtained from output at time <math>t</math></span><br />
* <span><math>\frac{\partial^{+} \mathbf{x}_{k}}{\partial \theta}</math> is the immediate partial derivative of state <math>\mathbf{x}_{k}</math></span>. For the parameterization above, <math>\frac{\partial^+ \mathbf{x}_k}{\partial \mathbf{W}_{rec}} = \sigma(\mathbf{x}_{k-1})</math>.<br />
<br />
The authors of this paper also distinguish between ''long-term'' and ''short-term'' contributions to the gradient with respect to <math>\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_k}</math>. The contribution is ''long-term'' if <math>k \ll t</math>, and ''short-term'' otherwise.<br />
<br />
== Exploding and Vanishing Gradients ==<br />
=== The Mechanics ===<br />
<br />
It's known that <math> |\sigma^'(x)| </math> is bounded. Let <math>\left|\left|diag(\sigma^'(x_k))\right|\right| \leq \gamma \in R</math>.<br />
<br />
The paper first proves that it is sufficient for <math> \lambda_1 < \frac{1}{\gamma} </math>, where <math> \lambda_1 </math> is the largest singular value of <math> \bold{W}_{rec} </math>, for the vanishing gradient problem to occur. The Jacobian matrix <math> \frac{\partial x_{k+1}}{\partial x_k} </math> is given by <math> \bold{ W}_{rec}^{T}diag(\sigma^'(x_k)) </math>. Then, the 2-norm of this Jacobian is bounded by the product of the norms of the two matrices. This leads to <math> \forall k, ||\frac{\partial{x_{k+1}}}{\partial x_k}|| \leq ||\bold{W}_{rec}^T||||diag(\sigma^'(x_k))|| < \frac{1}{\gamma}\gamma < 1</math><br />
<br />
Let <math>\eta \in R</math> be such that <math>\forall k, ||\frac{\partial {x_{k+1}}}{\partial x_k}|| \leq \eta < 1</math>. By induction over <math>i</math>, we can show that <math>||\frac{\partial \varepsilon_t}{\partial x_t}(\prod_{i=k}^{t-1}{\frac{\partial x_{i+1}}{\partial x_i}})|| \leq \eta^{t-k}||\frac{\partial \varepsilon_t}{\partial x_t}||</math>. Since <math> \eta < 1 </math>, as <math> t-k </math> goes larger, the gradient goes to 0.<br />
<br />
By inverting this proof, it also shows that when the largest singular value <math>\lambda_1 </math> is larger than <math> \frac{1}{\gamma}</math>, we will have exploding gradients.<br />
<br />
=== From a dynamical systems perspective ===<br />
<br />
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “<math>\theta</math> changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.<br />
<br />
The authors of this paper argue, however, that crossing these bifurcation points does not guarantee a sudden change in gradients. Their idea is that a change to the model parameters can alter the attractor landscape in such a way that basin of attraction corresponding to the current model state is unaltered. For example, a change to the model parameters might eliminate a basic of attraction in a portion of the model's state space that is very far from its current state. In this case, the bifurcation will have no effect on the asymptotic behaviour of the model, and there will accordingly be no gradient explosion. On the other hand, if a change to the model parameters substantially alters the final basin of attraction given the current state, then there will a considerable effect on the asymptotic behaviour of the model, and the gradients will explode accordingly. <br />
<br />
[[Image:dynamic_perspective.png|frame|center|400px|alt=|Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>]]<br />
<br />
The figure above depicts a bifurcation diagram for a single-unit RNN, where the x-axis is the parameter <math>b</math> (bias) and the y-axis is the asymptotic state <math>x_{\infty}</math>, (i.e. the equilibrium activation value of the unit), and the plot line is the movement of the final point attractor <math>x_{\infty}</math> as <math>b</math> changes. What this figure represents is the presence of two attractors, one emerging from <math>b_1</math> and another disappearing at <math>b_2</math>, as the value of <math>b</math> is decreased. Note that only one attractor exists when the value of <math>b</math> is outside of the interval between <math>b_1</math> and <math>b_2</math>, and that when two attractors exist, the attractor state towards which the unit ultimately gravitates is determined by its initial starting state. The boundary between the these two basins of attraction is denoted with the dashed line - starting states on opposite sides of this boundary will gravitate towards different attractor states. The blue filled circles indicate a bifurcation point at which a small change to the value of <math>b</math> can have a drastic effect on the attractor landscape over the unit's state space. In short, the landscape shifts to include a single attractor state for a low value of <math>x</math>. The unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time <math>0</math>, a small change in <math>b</math> would result in a sudden large change in <math>x_{t}</math>.<br />
<br />
Overall, these remarks indicate that, when treated as dynamical system, the behaviour of a RNN can be analyzed with respect to both changes to the parameter values that determine an attractor landscape over its state space (assuming a fixed starting state), and with respect to changes to the starting state (assuming a fixed attractor landscape).<br />
<br />
=== From a geometric perspective ===<br />
<br />
Aside from a dynamical systems prospective to exploding and vanishing gradients, the authors also considered a geometric perspective, where a simple one hidden unit RNN was considered.<br />
<br />
[[Image:geometric_perspective.png|frame|center|400px|alt=|Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>]]<br />
<br />
Reusing the RNN state equation the authors defined:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
By assuming no input, with <math>b = 0</math> and initial state <math>x_{0}</math>, the equation simplifies to:<br />
<br />
<math>x_{t} = W_{rec}^{t} x_{0}</math><br />
<br />
Differentiating the above equation to the first and second order would give:<br />
<br />
<math>\frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0}</math><br />
<br />
<math>\frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0}</math><br />
<br />
Which implies if the first order derivative explodes then so will the second derivative. In the general case, when the gradients explode they do so along some directions '''v'''. If this bound is tight, it is hypothesized that ''when gradients explode so does the curvature along'' '''v''''', leading to a wall in the error surface'', like the one seen above. If both the gradient and the leading eigenvector of the curvature are aligned with the exploding direction '''v''', it follows that the error surface has a steep wall perpendicular to '''v''' (and consequently to the gradient). This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away and possibly hinder the learning process. (See figure above). Note that this solution assumes that the valley bordered by a steep cliff in the value of the loss function is wide enough with respect the clip being applied to the gradient, otherwise the deflection caused by an update of SGD would still hinder the learning process despite clipping being used. The practical effectiveness of clipping provides some evidence in support of this assumption.<br />
<br />
= Dealing with Exploding and Vanishing Gradient =<br />
<br />
== Related Work ==<br />
<br />
* <span>'''L1/L2 Regularization''': Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.</span><br />
* <span>'''Teacher Forcing''': Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.</span><br />
* <span>'''LTSM''': The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of <math>1</math>. This solution however does not deal with the exploding gradient</span><br />
* <span>'''Hessian-Free optimizer with structural damping''': Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.</span><br />
* <span>'''Echo State Networks''': <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.</span><br />
<br />
== Scaling Down the Gradients ==<br />
<br />
[[Image:gradient_clipping.png|frame|center|400px]]<br />
<br />
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.<br />
<br />
== Vanishing Gradient Regularization ==<br />
<br />
The vanishing gradient regularizer is as follows:<br />
<br />
<math>\Omega <br />
= \sum_{k} \Omega_{k} <br />
= \sum_{k} <br />
\left( <br />
\frac{<br />
\| <br />
\frac{\delta \varepsilon}{\delta x_{k + 1}} <br />
\frac{\delta x_{k + 1}}{\delta x_{k}}<br />
\|<br />
}<br />
{<br />
\|<br />
\frac{\delta \varepsilon}{\delta x_{k + 1}}<br />
\| <br />
} - 1<br />
\right)^{2}</math><br />
<br />
The vanishing gradient problem occurs when at time <math>t</math> the inputs <math>u</math> may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs <math>u_{t} \dots u_{k}</math> could be increased by increasing the norm of <math>\frac{\delta x_t}{\delta x_{t}}</math>. This imples that in order to increses the <math>\frac{\delta x_t}{\delta x_{t}}</math> norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices <math>\frac{\delta x_{k + 1}}{\delta x_{k}}</math> to preserve norm in the direction of the error <math>\frac{\delta \varepsilon}{\delta x_{k + 1}}</math>.<br />
<br />
= Experimental Results =<br />
<br />
== The Temporal Order Problem ==<br />
<br />
The authors repeated the temporal order problem as the prototypical pathological problem for validating the cliping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an <math>A</math> or a <math>B</math> symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of <math>A</math> and <math>B</math> at the end of the sequence.<br />
<br />
Three different RNN intializations were performed for the experiment:<br />
<br />
* <span>'''sigmoid unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
* <span>'''basic tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1)</math></span><br />
* <span>'''smart tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
<br />
Of the three RNN networks three different optimizer configurations were used:<br />
<br />
* <span>'''MSGD''': Mini-batch Stochastic Gradient Decent</span><br />
* <span>'''MSGD-C''': MSGD with Gradient Clipping</span><br />
* <span>'''MSGD-CR''': MSGD-C with Regularization</span><br />
<br />
Additional model parameters include:<br />
<br />
* <span>Spectral Radius of 0.95</span><br />
* <span><math>b = b_{out} = 0</math></span><br />
* <span>50 hidden neurons</span><br />
* <span>constant learning rate of 0.01</span><br />
* <span>clipping threshold of 1 (only for MSGD-C and MSGD-CR)</span><br />
* <span>regularization weight of 4 (MSGD-CR)</span><br />
<br />
The experiment was performed 5 times, from the figure below we can observe the importance of gradient cliping and the regularizer, in all cases the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding graidents correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important. This is largely due to the fact that increased memory yields a larger spectral radius, which in turn leads to increased likelihood of gradient explosion.<br />
<br />
[[Image:experimental_results.png|frame|center|400px]]<br />
<br />
== Other Pathological Problems ==<br />
<br />
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:<br />
<br />
[[Image:experimental_results_2.png|image]]<br />
<br />
The authors did not discuss the experimental results in detail here.<br />
<br />
= Summary =<br />
<br />
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer, their experimental results show that in all cases except for the Penn Treebank dataset, that cliping and regularizer has bested the state of the art for RNN in their respective experiment performances.</div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_difficulty_of_training_recurrent_neural_networks&diff=27072on the difficulty of training recurrent neural networks2015-12-04T04:21:00Z<p>Rtwang: /* From a dynamical systems perspective */</p>
<hr />
<div>= Introduction =<br />
<br />
Training Recurrent Neural Network (RNN) is difficult and two of the most prominent problems have been vanishing and exploding gradients, <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages<br />
1183–1188. IEEE, 1993.</ref> which prevents neural networks from learning and fitting data with long-term dependencies. In this paper the authors propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem.<br />
<br />
= Background =<br />
<br />
[[Image:rnn_2.png|frame|center|400px|alt=| Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>]]<br />
A generic recurrent neural network has the form:<br />
<br />
<math>x_{t} = F(\mathbf{x}_{t -1}, \mathbf{u}_{t}, \theta)</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{x}_{t}</math> is the state at time <math>t</math></span><br />
* <span><math>\mathbf{u}_{t}</math> is the input at time <math>t</math></span><br />
* <span><math>\theta\,</math> are the parameters</span><br />
* <span><math>F()\,</math> is the function that represents a neuron</span><br />
<br />
In the theoreical sections the authors made use of specific parameterization:<br />
<br />
<math>\mathbf{x}_{t} = \mathbf{W}_{rec} \sigma(\mathbf{x}_{t - 1}) + \mathbf{W}_{in} \mathbf{u}_{t} + b</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{W}_{rec}</math> is the RNN weight matrix</span><br />
* <span><math>\sigma()\,</math> is an element wise function</span><br />
* <span><math>b\,</math> is the bias</span><br />
* <span><math>\mathbf{W}_{in}</math> is the input weights matrix</span><br />
<br />
The following are gradient equations for using the Back Propagation Through Time (BPTT) algorithm. The authors rewrote the equations in order to highlight the exploding gradients problem:<br />
<br />
<math>\frac{\partial \varepsilon}{\partial \theta} = <br />
\sum_{1 \leq t \leq T} \frac{\partial \varepsilon_t}{\partial \theta}</math><br />
<br />
<math>\frac{\partial \varepsilon_{t}}{\partial \theta} = <br />
\sum_{1 \leq k \leq T} <br />
\left(<br />
\frac{\partial \varepsilon_{t}}{\partial x_{t}}<br />
\frac{\partial x_{t}}{\partial x_{k}}<br />
\frac{\partial^{+} x_{k}}{\partial \theta}<br />
\right)</math><br />
<br />
<math>\frac{\partial x_{t}}{\partial x_{k}} =<br />
\prod_{t \leq i \leq k} \frac{\partial x_{i}}{\partial x_{i - 1}} =<br />
\prod_{t \leq i \leq k} <br />
\mathbf{W}^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1}))</math><br />
<br />
Where:<br />
<br />
* <span><math>\varepsilon_{t}</math> is the error obtained from output at time <math>t</math></span><br />
* <span><math>\frac{\partial^{+} \mathbf{x}_{k}}{\partial \theta}</math> is the immediate partial derivative of state <math>\mathbf{x}_{k}</math></span>. For the parameterization above, <math>\frac{\partial^+ \mathbf{x}_k}{\partial \mathbf{W}_{rec}} = \sigma(\mathbf{x}_{k-1})</math>.<br />
<br />
The authors of this paper also distinguish between ''long-term'' and ''short-term'' contributions to the gradient with respect to <math>\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_k}</math>. The contribution is ''long-term'' if <math>k \ll t</math>, and ''short-term'' otherwise.<br />
<br />
== Exploding and Vanishing Gradients ==<br />
=== The Mechanics ===<br />
<br />
It's known that <math> |\sigma^'(x)| </math> is bounded. Let <math>\left|\left|diag(\sigma^'(x_k))\right|\right| \leq \gamma \in R</math>.<br />
<br />
The paper first proves that it is sufficient for <math> \lambda_1 < \frac{1}{\gamma} </math>, where <math> \lambda_1 </math> is the largest singular value of <math> \bold{W}_{rec} </math>, for the vanishing gradient problem to occur. The Jacobian matrix <math> \frac{\partial x_{k+1}}{\partial x_k} </math> is given by <math> \bold{ W}_{rec}^{T}diag(\sigma^'(x_k)) </math>. Then, the 2-norm of this Jacobian is bounded by the product of the norms of the two matrices. This leads to <math> \forall k, ||\frac{\partial{x_{k+1}}}{\partial x_k}|| \leq ||\bold{W}_{rec}^T||||diag(\sigma^'(x_k))|| < \frac{1}{\gamma}\gamma < 1</math><br />
<br />
Let <math>\eta \in R</math> be such that <math>\forall k, ||\frac{\partial {x_{k+1}}}{\partial x_k}|| \leq \eta < 1</math>. By induction over <math>i</math>, we can show that <math>||\frac{\partial \varepsilon_t}{\partial x_t}(\prod_{i=k}^{t-1}{\frac{\partial x_{i+1}}{\partial x_i}})|| \leq \eta^{t-k}||\frac{\partial \varepsilon_t}{\partial x_t}||</math>. Since <math> \eta < 1 </math>, as <math> t-k </math> goes larger, the gradient goes to 0.<br />
<br />
By inverting this proof, it also shows that when the largest singular value <math>\lambda_1 </math> is larger than <math> \frac{1}{\gamma}</math>, we will have exploding gradients.<br />
<br />
=== From a dynamical systems perspective ===<br />
<br />
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “<math>\theta</math> changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.<br />
<br />
The authors of this paper argue, however, that crossing these bifurcation points does not guarantee a sudden change in gradients. Their idea is that a change to the model parameters can alter the attractor landscape in such a way that basin of attraction corresponding to the current model state is unaltered. For example, a change to the model parameters might eliminate a basic of attraction in a portion of the model's state space that is very far from its current state. In this case, the bifurcation will have no effect on the asymptotic behaviour of the model, and there will accordingly be no gradient explosion. On the other hand, if a change to the model parameters substantially alters the final basin of attraction given the current state, then there will a considerable effect on the asymptotic behaviour of the model, and the gradients will explode accordingly. <br />
<br />
[[Image:dynamic_perspective.png|frame|center|400px|alt=|Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>]]<br />
<br />
The figure above depicts a bifurcation diagram for a single-unit RNN, where the x-axis is the parameter <math>b</math> (bias) and the y-axis is the asymptotic state <math>x_{\infty}</math>, (i.e. the equilibrium activation value of the unit), and the plot line is the movement of the final point attractor <math>x_{\infty}</math> as <math>b</math> changes. What this figure represents is the presence of two attractors, one emerging from <math>b_1</math> and another disappearing at <math>b_2</math>, as the value of <math>b</math> is decreased. Note that only one attractor exists when the value of <math>b</math> is outside of the interval between <math>b_1</math> and <math>b_2</math>, and that when two attractors exist, the attractor state towards which the unit ultimately gravitates is determined by its initial starting state. The boundary between the these two basins of attraction is denoted with the dashed line - starting states on opposite sides of this boundary will gravitate towards different attractor states. The blue filled circles indicate a bifurcation point at which a small change to the value of <math>b</math> can have a drastic effect on the attractor landscape over the unit's state space. In short, the landscape shifts to include a single attractor state for a low value of <math>x</math>. The unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time <math>0</math>, a small change in <math>b</math> would result in a sudden large change in <math>x_{t}</math>.<br />
<br />
Overall, these remarks indicate that, when treated as dynamical system, the behaviour of a RNN can be analyzed with respect to both changes to the parameter values that determine an attractor landscape over its state space (assuming a fixed starting state), and with respect to changes to the starting state (assuming a fixed attractor landscape).<br />
<br />
=== From a geometric perspective ===<br />
<br />
Aside from a dynamical systems prospective to exploding and vanishing gradients, the authors also considered a geometric perspective, where a simple one hidden unit RNN was considered.<br />
<br />
[[Image:geometric_perspective.png|frame|center|400px|alt=|Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>]]<br />
<br />
Reusing the RNN state equation the authors defined:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
By assuming no input, with <math>b = 0</math> and initial state <math>x_{0}</math>, the equation simplifies to:<br />
<br />
<math>x_{t} = W_{rec}^{t} x_{0}</math><br />
<br />
Differentiating the above equation to the first and second order would give:<br />
<br />
<math>\frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0}</math><br />
<br />
<math>\frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0}</math><br />
<br />
Which implies if the first order derivative explodes so will the second derivative. In the general case, when they gradients explode they do so along some directions '''v'''. If this bound is tight, it is hypothesized that ''when gradients explode so does the curvature along'' '''v''''', leading to a wall in the error surface'', like the one seen above. If both the gradient and the leading eigenvector of the curvature are aligned with the exploding direction '''v''', it follows that the error surface has a steep wall perpendicular to '''v''' (and consequently to the gradient). This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away, possibly hindering the learning process. (See figure above). Note that this solution assumes that the valley bordered by a steep cliff in the value of the loss function is wide enough with respect the clip being applied to the gradient - otherwise, the deflection caused by an update of SGD would still hinder the learning process, even when clipping is used. The practical effectiveness of clipping provides some evidence in support of this assumption.<br />
<br />
= Dealing with Exploding and Vanishing Gradient =<br />
<br />
== Related Work ==<br />
<br />
* <span>'''L1/L2 Regularization''': Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.</span><br />
* <span>'''Teacher Forcing''': Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.</span><br />
* <span>'''LTSM''': The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of <math>1</math>. This solution however does not deal with the exploding gradient</span><br />
* <span>'''Hessian-Free optimizer with structural damping''': Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.</span><br />
* <span>'''Echo State Networks''': <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.</span><br />
<br />
== Scaling Down the Gradients ==<br />
<br />
[[Image:gradient_clipping.png|frame|center|400px]]<br />
<br />
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.<br />
<br />
== Vanishing Gradient Regularization ==<br />
<br />
The vanishing gradient regularizer is as follows:<br />
<br />
<math>\Omega <br />
= \sum_{k} \Omega_{k} <br />
= \sum_{k} <br />
\left( <br />
\frac{<br />
\| <br />
\frac{\delta \varepsilon}{\delta x_{k + 1}} <br />
\frac{\delta x_{k + 1}}{\delta x_{k}}<br />
\|<br />
}<br />
{<br />
\|<br />
\frac{\delta \varepsilon}{\delta x_{k + 1}}<br />
\| <br />
} - 1<br />
\right)^{2}</math><br />
<br />
The vanishing gradient problem occurs when at time <math>t</math> the inputs <math>u</math> may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs <math>u_{t} \dots u_{k}</math> could be increased by increasing the norm of <math>\frac{\delta x_t}{\delta x_{t}}</math>. This imples that in order to increses the <math>\frac{\delta x_t}{\delta x_{t}}</math> norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices <math>\frac{\delta x_{k + 1}}{\delta x_{k}}</math> to preserve norm in the direction of the error <math>\frac{\delta \varepsilon}{\delta x_{k + 1}}</math>.<br />
<br />
= Experimental Results =<br />
<br />
== The Temporal Order Problem ==<br />
<br />
The authors repeated the temporal order problem as the prototypical pathological problem for validating the cliping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an <math>A</math> or a <math>B</math> symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of <math>A</math> and <math>B</math> at the end of the sequence.<br />
<br />
Three different RNN intializations were performed for the experiment:<br />
<br />
* <span>'''sigmoid unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
* <span>'''basic tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1)</math></span><br />
* <span>'''smart tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
<br />
Of the three RNN networks three different optimizer configurations were used:<br />
<br />
* <span>'''MSGD''': Mini-batch Stochastic Gradient Decent</span><br />
* <span>'''MSGD-C''': MSGD with Gradient Clipping</span><br />
* <span>'''MSGD-CR''': MSGD-C with Regularization</span><br />
<br />
Additional model parameters include:<br />
<br />
* <span>Spectral Radius of 0.95</span><br />
* <span><math>b = b_{out} = 0</math></span><br />
* <span>50 hidden neurons</span><br />
* <span>constant learning rate of 0.01</span><br />
* <span>clipping threshold of 1 (only for MSGD-C and MSGD-CR)</span><br />
* <span>regularization weight of 4 (MSGD-CR)</span><br />
<br />
The experiment was performed 5 times, from the figure below we can observe the importance of gradient cliping and the regularizer, in all cases the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding graidents correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important. This is largely due to the fact that increased memory yields a larger spectral radius, which in turn leads to increased likelihood of gradient explosion.<br />
<br />
[[Image:experimental_results.png|frame|center|400px]]<br />
<br />
== Other Pathological Problems ==<br />
<br />
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:<br />
<br />
[[Image:experimental_results_2.png|image]]<br />
<br />
The authors did not discuss the experimental results in detail here.<br />
<br />
= Summary =<br />
<br />
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer, their experimental results show that in all cases except for the Penn Treebank dataset, that cliping and regularizer has bested the state of the art for RNN in their respective experiment performances.</div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_difficulty_of_training_recurrent_neural_networks&diff=27071on the difficulty of training recurrent neural networks2015-12-04T04:18:37Z<p>Rtwang: /* The Mechanics */</p>
<hr />
<div>= Introduction =<br />
<br />
Training Recurrent Neural Network (RNN) is difficult and two of the most prominent problems have been vanishing and exploding gradients, <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages<br />
1183–1188. IEEE, 1993.</ref> which prevents neural networks from learning and fitting data with long-term dependencies. In this paper the authors propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem.<br />
<br />
= Background =<br />
<br />
[[Image:rnn_2.png|frame|center|400px|alt=| Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>]]<br />
A generic recurrent neural network has the form:<br />
<br />
<math>x_{t} = F(\mathbf{x}_{t -1}, \mathbf{u}_{t}, \theta)</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{x}_{t}</math> is the state at time <math>t</math></span><br />
* <span><math>\mathbf{u}_{t}</math> is the input at time <math>t</math></span><br />
* <span><math>\theta\,</math> are the parameters</span><br />
* <span><math>F()\,</math> is the function that represents a neuron</span><br />
<br />
In the theoreical sections the authors made use of specific parameterization:<br />
<br />
<math>\mathbf{x}_{t} = \mathbf{W}_{rec} \sigma(\mathbf{x}_{t - 1}) + \mathbf{W}_{in} \mathbf{u}_{t} + b</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{W}_{rec}</math> is the RNN weight matrix</span><br />
* <span><math>\sigma()\,</math> is an element wise function</span><br />
* <span><math>b\,</math> is the bias</span><br />
* <span><math>\mathbf{W}_{in}</math> is the input weights matrix</span><br />
<br />
The following are gradient equations for using the Back Propagation Through Time (BPTT) algorithm. The authors rewrote the equations in order to highlight the exploding gradients problem:<br />
<br />
<math>\frac{\partial \varepsilon}{\partial \theta} = <br />
\sum_{1 \leq t \leq T} \frac{\partial \varepsilon_t}{\partial \theta}</math><br />
<br />
<math>\frac{\partial \varepsilon_{t}}{\partial \theta} = <br />
\sum_{1 \leq k \leq T} <br />
\left(<br />
\frac{\partial \varepsilon_{t}}{\partial x_{t}}<br />
\frac{\partial x_{t}}{\partial x_{k}}<br />
\frac{\partial^{+} x_{k}}{\partial \theta}<br />
\right)</math><br />
<br />
<math>\frac{\partial x_{t}}{\partial x_{k}} =<br />
\prod_{t \leq i \leq k} \frac{\partial x_{i}}{\partial x_{i - 1}} =<br />
\prod_{t \leq i \leq k} <br />
\mathbf{W}^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1}))</math><br />
<br />
Where:<br />
<br />
* <span><math>\varepsilon_{t}</math> is the error obtained from output at time <math>t</math></span><br />
* <span><math>\frac{\partial^{+} \mathbf{x}_{k}}{\partial \theta}</math> is the immediate partial derivative of state <math>\mathbf{x}_{k}</math></span>. For the parameterization above, <math>\frac{\partial^+ \mathbf{x}_k}{\partial \mathbf{W}_{rec}} = \sigma(\mathbf{x}_{k-1})</math>.<br />
<br />
The authors of this paper also distinguish between ''long-term'' and ''short-term'' contributions to the gradient with respect to <math>\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_k}</math>. The contribution is ''long-term'' if <math>k \ll t</math>, and ''short-term'' otherwise.<br />
<br />
== Exploding and Vanishing Gradients ==<br />
=== The Mechanics ===<br />
<br />
It's known that <math> |\sigma^'(x)| </math> is bounded. Let <math>\left|\left|diag(\sigma^'(x_k))\right|\right| \leq \gamma \in R</math>.<br />
<br />
The paper first proves that it is sufficient for <math> \lambda_1 < \frac{1}{\gamma} </math>, where <math> \lambda_1 </math> is the largest singular value of <math> \bold{W}_{rec} </math>, for the vanishing gradient problem to occur. The Jacobian matrix <math> \frac{\partial x_{k+1}}{\partial x_k} </math> is given by <math> \bold{ W}_{rec}^{T}diag(\sigma^'(x_k)) </math>. Then, the 2-norm of this Jacobian is bounded by the product of the norms of the two matrices. This leads to <math> \forall k, ||\frac{\partial{x_{k+1}}}{\partial x_k}|| \leq ||\bold{W}_{rec}^T||||diag(\sigma^'(x_k))|| < \frac{1}{\gamma}\gamma < 1</math><br />
<br />
Let <math>\eta \in R</math> be such that <math>\forall k, ||\frac{\partial {x_{k+1}}}{\partial x_k}|| \leq \eta < 1</math>. By induction over <math>i</math>, we can show that <math>||\frac{\partial \varepsilon_t}{\partial x_t}(\prod_{i=k}^{t-1}{\frac{\partial x_{i+1}}{\partial x_i}})|| \leq \eta^{t-k}||\frac{\partial \varepsilon_t}{\partial x_t}||</math>. Since <math> \eta < 1 </math>, as <math> t-k </math> goes larger, the gradient goes to 0.<br />
<br />
By inverting this proof, it also shows that when the largest singular value <math>\lambda_1 </math> is larger than <math> \frac{1}{\gamma}</math>, we will have exploding gradients.<br />
<br />
=== From a dynamical systems perspective ===<br />
<br />
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “<math>\theta</math> changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.<br />
<br />
The authors of this paper argue, however, that crossing these bifurcation points does not guarantee a sudden chage in gradients. Their idea is that a change to the model parameters can alter the attractor landscape in such a way that basin of attraction corresponding to the current model state is unaltered. For example, a change to the model parameters might eliminate a basic of attraction in a portion of the model's state space that is very far from its current state. In this case, the bifurcation will have no effect on the asymptotic behaviour of the model, and there will accordingly be no gradient explosion. On the other hand, if a change to the model parameters substantially alters the final basin of attraction given the current state, then there will a considerable effect on the asymptotic behaviour of the model, and the gradients will accordingly explode. <br />
<br />
[[Image:dynamic_perspective.png|frame|center|400px|alt=|Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>]]<br />
<br />
The figure above depicts a bifurcation diagram for a single-unit RNN, where the x-axis is the parameter <math>b</math> (bias) and the y-axis is the asymptotoc state <math>x_{\infty}</math>, (i.e. the equilibrium activation value of the unit), and the plot line is the movement of the final point attractor <math>x_{\infty}</math> as <math>b</math> changes. What this figure represents is the presence of two attractors, one emerging from <math>b_1</math> and another disappearing at <math>b_2</math>, as the value of <math>b</math> is decreased. Note that only one attractor exists when the value of <math>b</math> is outside of the interval between <math>b_1</math> and <math>b_2</math>, and that when two attractors exist, the attractor state towards which the unit ultimately gravitates is determined by its initial starting state. The boundary between the these two basins of attraction is denoted with the dashed line - starting states on opposite sides of this boundary will gravitate towards different attractor states. The blue filled circles indicate a bifurcation point at which a small change to the value of <math>b</math> can have a drastic effect on the attractor landscape over the unit's state space. In short, the landscape shifts to include a single attractor state for a low value of <math>x</math>. The unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time <math>0</math>, a small change in <math>b</math> would result in a sudden large change in <math>x_{t}</math>.<br />
<br />
Overall, these remarks indicate that, when treated as dynamical system, the behaviour of a RNN can be analyzed with respect to both changes to the parameter values that determine an attractor landscape over its state space (assuming a fixed starting state), and with respect to changes to the starting state (assuming a fixed attractor landscape). <br />
<br />
=== From a geometric perspective ===<br />
<br />
Aside from a dynamical systems prospective to exploding and vanishing gradients, the authors also considered a geometric perspective, where a simple one hidden unit RNN was considered.<br />
<br />
[[Image:geometric_perspective.png|frame|center|400px|alt=|Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>]]<br />
<br />
Reusing the RNN state equation the authors defined:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
By assuming no input, with <math>b = 0</math> and initial state <math>x_{0}</math>, the equation simplifies to:<br />
<br />
<math>x_{t} = W_{rec}^{t} x_{0}</math><br />
<br />
Differentiating the above equation to the first and second order would give:<br />
<br />
<math>\frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0}</math><br />
<br />
<math>\frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0}</math><br />
<br />
Which implies if the first order derivative explodes so will the second derivative. In the general case, when they gradients explode they do so along some directions '''v'''. If this bound is tight, it is hypothesized that ''when gradients explode so does the curvature along'' '''v''''', leading to a wall in the error surface'', like the one seen above. If both the gradient and the leading eigenvector of the curvature are aligned with the exploding direction '''v''', it follows that the error surface has a steep wall perpendicular to '''v''' (and consequently to the gradient). This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away, possibly hindering the learning process. (See figure above). Note that this solution assumes that the valley bordered by a steep cliff in the value of the loss function is wide enough with respect the clip being applied to the gradient - otherwise, the deflection caused by an update of SGD would still hinder the learning process, even when clipping is used. The practical effectiveness of clipping provides some evidence in support of this assumption.<br />
<br />
= Dealing with Exploding and Vanishing Gradient =<br />
<br />
== Related Work ==<br />
<br />
* <span>'''L1/L2 Regularization''': Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.</span><br />
* <span>'''Teacher Forcing''': Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.</span><br />
* <span>'''LTSM''': The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of <math>1</math>. This solution however does not deal with the exploding gradient</span><br />
* <span>'''Hessian-Free optimizer with structural damping''': Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.</span><br />
* <span>'''Echo State Networks''': <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.</span><br />
<br />
== Scaling Down the Gradients ==<br />
<br />
[[Image:gradient_clipping.png|frame|center|400px]]<br />
<br />
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.<br />
<br />
== Vanishing Gradient Regularization ==<br />
<br />
The vanishing gradient regularizer is as follows:<br />
<br />
<math>\Omega <br />
= \sum_{k} \Omega_{k} <br />
= \sum_{k} <br />
\left( <br />
\frac{<br />
\| <br />
\frac{\delta \varepsilon}{\delta x_{k + 1}} <br />
\frac{\delta x_{k + 1}}{\delta x_{k}}<br />
\|<br />
}<br />
{<br />
\|<br />
\frac{\delta \varepsilon}{\delta x_{k + 1}}<br />
\| <br />
} - 1<br />
\right)^{2}</math><br />
<br />
The vanishing gradient problem occurs when at time <math>t</math> the inputs <math>u</math> may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs <math>u_{t} \dots u_{k}</math> could be increased by increasing the norm of <math>\frac{\delta x_t}{\delta x_{t}}</math>. This imples that in order to increses the <math>\frac{\delta x_t}{\delta x_{t}}</math> norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices <math>\frac{\delta x_{k + 1}}{\delta x_{k}}</math> to preserve norm in the direction of the error <math>\frac{\delta \varepsilon}{\delta x_{k + 1}}</math>.<br />
<br />
= Experimental Results =<br />
<br />
== The Temporal Order Problem ==<br />
<br />
The authors repeated the temporal order problem as the prototypical pathological problem for validating the cliping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an <math>A</math> or a <math>B</math> symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of <math>A</math> and <math>B</math> at the end of the sequence.<br />
<br />
Three different RNN intializations were performed for the experiment:<br />
<br />
* <span>'''sigmoid unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
* <span>'''basic tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1)</math></span><br />
* <span>'''smart tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
<br />
Of the three RNN networks three different optimizer configurations were used:<br />
<br />
* <span>'''MSGD''': Mini-batch Stochastic Gradient Decent</span><br />
* <span>'''MSGD-C''': MSGD with Gradient Clipping</span><br />
* <span>'''MSGD-CR''': MSGD-C with Regularization</span><br />
<br />
Additional model parameters include:<br />
<br />
* <span>Spectral Radius of 0.95</span><br />
* <span><math>b = b_{out} = 0</math></span><br />
* <span>50 hidden neurons</span><br />
* <span>constant learning rate of 0.01</span><br />
* <span>clipping threshold of 1 (only for MSGD-C and MSGD-CR)</span><br />
* <span>regularization weight of 4 (MSGD-CR)</span><br />
<br />
The experiment was performed 5 times, from the figure below we can observe the importance of gradient cliping and the regularizer, in all cases the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding graidents correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important. This is largely due to the fact that increased memory yields a larger spectral radius, which in turn leads to increased likelihood of gradient explosion.<br />
<br />
[[Image:experimental_results.png|frame|center|400px]]<br />
<br />
== Other Pathological Problems ==<br />
<br />
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:<br />
<br />
[[Image:experimental_results_2.png|image]]<br />
<br />
The authors did not discuss the experimental results in detail here.<br />
<br />
= Summary =<br />
<br />
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer, their experimental results show that in all cases except for the Penn Treebank dataset, that cliping and regularizer has bested the state of the art for RNN in their respective experiment performances.</div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_difficulty_of_training_recurrent_neural_networks&diff=27070on the difficulty of training recurrent neural networks2015-12-04T04:17:27Z<p>Rtwang: /* Introduction */</p>
<hr />
<div>= Introduction =<br />
<br />
Training Recurrent Neural Network (RNN) is difficult and two of the most prominent problems have been vanishing and exploding gradients, <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages<br />
1183–1188. IEEE, 1993.</ref> which prevents neural networks from learning and fitting data with long-term dependencies. In this paper the authors propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem.<br />
<br />
= Background =<br />
<br />
[[Image:rnn_2.png|frame|center|400px|alt=| Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>]]<br />
A generic recurrent neural network has the form:<br />
<br />
<math>x_{t} = F(\mathbf{x}_{t -1}, \mathbf{u}_{t}, \theta)</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{x}_{t}</math> is the state at time <math>t</math></span><br />
* <span><math>\mathbf{u}_{t}</math> is the input at time <math>t</math></span><br />
* <span><math>\theta\,</math> are the parameters</span><br />
* <span><math>F()\,</math> is the function that represents a neuron</span><br />
<br />
In the theoreical sections the authors made use of specific parameterization:<br />
<br />
<math>\mathbf{x}_{t} = \mathbf{W}_{rec} \sigma(\mathbf{x}_{t - 1}) + \mathbf{W}_{in} \mathbf{u}_{t} + b</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{W}_{rec}</math> is the RNN weight matrix</span><br />
* <span><math>\sigma()\,</math> is an element wise function</span><br />
* <span><math>b\,</math> is the bias</span><br />
* <span><math>\mathbf{W}_{in}</math> is the input weights matrix</span><br />
<br />
The following are gradient equations for using the Back Propagation Through Time (BPTT) algorithm. The authors rewrote the equations in order to highlight the exploding gradients problem:<br />
<br />
<math>\frac{\partial \varepsilon}{\partial \theta} = <br />
\sum_{1 \leq t \leq T} \frac{\partial \varepsilon_t}{\partial \theta}</math><br />
<br />
<math>\frac{\partial \varepsilon_{t}}{\partial \theta} = <br />
\sum_{1 \leq k \leq T} <br />
\left(<br />
\frac{\partial \varepsilon_{t}}{\partial x_{t}}<br />
\frac{\partial x_{t}}{\partial x_{k}}<br />
\frac{\partial^{+} x_{k}}{\partial \theta}<br />
\right)</math><br />
<br />
<math>\frac{\partial x_{t}}{\partial x_{k}} =<br />
\prod_{t \leq i \leq k} \frac{\partial x_{i}}{\partial x_{i - 1}} =<br />
\prod_{t \leq i \leq k} <br />
\mathbf{W}^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1}))</math><br />
<br />
Where:<br />
<br />
* <span><math>\varepsilon_{t}</math> is the error obtained from output at time <math>t</math></span><br />
* <span><math>\frac{\partial^{+} \mathbf{x}_{k}}{\partial \theta}</math> is the immediate partial derivative of state <math>\mathbf{x}_{k}</math></span>. For the parameterization above, <math>\frac{\partial^+ \mathbf{x}_k}{\partial \mathbf{W}_{rec}} = \sigma(\mathbf{x}_{k-1})</math>.<br />
<br />
The authors of this paper also distinguish between ''long-term'' and ''short-term'' contributions to the gradient with respect to <math>\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_k}</math>. The contribution is ''long-term'' if <math>k \ll t</math>, and ''short-term'' otherwise.<br />
<br />
== Exploding and Vanishing Gradients ==<br />
=== The Mechanics ===<br />
<br />
It's known that <math> |\sigma^'(x)| </math> is bounded. Let <math>\left|\left|diag(\sigma^'(x_k))\right|\right| \leq \gamma \in R</math>.<br />
<br />
The paper first proves that it is sufficient for <math> \lambda_1 < \frac{1}{\gamma} </math>, where <math> \lambda_1 </math> is the largest singular value of <math> \bold{W}_{rec} </math>, for the vanishing gradient problem to occur. The Jacobian matrix <math> \frac{\partial x_{k+1}}{\partial x_k} </math> is given by <math> \bold{ W}_{rec}^{T}diag(\sigma^'(x_k)) </math>. Then, the 2-norm of this Jacobian is bounded by the product of the norms of the two matrices. This leads to <math> \forall k, ||\frac{\partial{x_{k+1}}}{\partial x_k}|| \leq ||\bold{W}_{rec}^T||||diag(\sigma^'(x_k))|| < \frac{1}{\gamma}\gamma < 1</math><br />
<br />
Let <math>\eta \in R</math> be such that <math>\forall k, ||\frac{\partial {x_{k+1}}}{\partial x_k}|| \leq \eta < 1</math>. By induction over <math>i</math>, we can show that <math>||\frac{\partial \varepsilon_t}{\partial x_t}(\prod_{i=k}^{t-1}{\frac{\partial x_{i+1}}{\partial x_i}})|| \leq \eta^{t-k}||\frac{\partial \varepsilon_t}{\partial x_t}||</math>. Since <math> \eta < 1 </math>, as <math> t-k </math> goes larger, the gradient goes to 0.<br />
<br />
By inverting this proof, it also shows that when the largest singular value <math>\lambda_1 </math> is larger than <math> \frac{1}{\gamma}</math>, we will have exploding gradients (otherwise the long term components would vanish instead of exploding).<br />
<br />
=== From a dynamical systems perspective ===<br />
<br />
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “<math>\theta</math> changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.<br />
<br />
The authors of this paper argue, however, that crossing these bifurcation points does not guarantee a sudden chage in gradients. Their idea is that a change to the model parameters can alter the attractor landscape in such a way that basin of attraction corresponding to the current model state is unaltered. For example, a change to the model parameters might eliminate a basic of attraction in a portion of the model's state space that is very far from its current state. In this case, the bifurcation will have no effect on the asymptotic behaviour of the model, and there will accordingly be no gradient explosion. On the other hand, if a change to the model parameters substantially alters the final basin of attraction given the current state, then there will a considerable effect on the asymptotic behaviour of the model, and the gradients will accordingly explode. <br />
<br />
[[Image:dynamic_perspective.png|frame|center|400px|alt=|Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>]]<br />
<br />
The figure above depicts a bifurcation diagram for a single-unit RNN, where the x-axis is the parameter <math>b</math> (bias) and the y-axis is the asymptotoc state <math>x_{\infty}</math>, (i.e. the equilibrium activation value of the unit), and the plot line is the movement of the final point attractor <math>x_{\infty}</math> as <math>b</math> changes. What this figure represents is the presence of two attractors, one emerging from <math>b_1</math> and another disappearing at <math>b_2</math>, as the value of <math>b</math> is decreased. Note that only one attractor exists when the value of <math>b</math> is outside of the interval between <math>b_1</math> and <math>b_2</math>, and that when two attractors exist, the attractor state towards which the unit ultimately gravitates is determined by its initial starting state. The boundary between the these two basins of attraction is denoted with the dashed line - starting states on opposite sides of this boundary will gravitate towards different attractor states. The blue filled circles indicate a bifurcation point at which a small change to the value of <math>b</math> can have a drastic effect on the attractor landscape over the unit's state space. In short, the landscape shifts to include a single attractor state for a low value of <math>x</math>. The unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time <math>0</math>, a small change in <math>b</math> would result in a sudden large change in <math>x_{t}</math>.<br />
<br />
Overall, these remarks indicate that, when treated as dynamical system, the behaviour of a RNN can be analyzed with respect to both changes to the parameter values that determine an attractor landscape over its state space (assuming a fixed starting state), and with respect to changes to the starting state (assuming a fixed attractor landscape). <br />
<br />
=== From a geometric perspective ===<br />
<br />
Aside from a dynamical systems prospective to exploding and vanishing gradients, the authors also considered a geometric perspective, where a simple one hidden unit RNN was considered.<br />
<br />
[[Image:geometric_perspective.png|frame|center|400px|alt=|Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>]]<br />
<br />
Reusing the RNN state equation the authors defined:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
By assuming no input, with <math>b = 0</math> and initial state <math>x_{0}</math>, the equation simplifies to:<br />
<br />
<math>x_{t} = W_{rec}^{t} x_{0}</math><br />
<br />
Differentiating the above equation to the first and second order would give:<br />
<br />
<math>\frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0}</math><br />
<br />
<math>\frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0}</math><br />
<br />
Which implies if the first order derivative explodes so will the second derivative. In the general case, when they gradients explode they do so along some directions '''v'''. If this bound is tight, it is hypothesized that ''when gradients explode so does the curvature along'' '''v''''', leading to a wall in the error surface'', like the one seen above. If both the gradient and the leading eigenvector of the curvature are aligned with the exploding direction '''v''', it follows that the error surface has a steep wall perpendicular to '''v''' (and consequently to the gradient). This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away, possibly hindering the learning process. (See figure above). Note that this solution assumes that the valley bordered by a steep cliff in the value of the loss function is wide enough with respect the clip being applied to the gradient - otherwise, the deflection caused by an update of SGD would still hinder the learning process, even when clipping is used. The practical effectiveness of clipping provides some evidence in support of this assumption.<br />
<br />
= Dealing with Exploding and Vanishing Gradient =<br />
<br />
== Related Work ==<br />
<br />
* <span>'''L1/L2 Regularization''': Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.</span><br />
* <span>'''Teacher Forcing''': Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.</span><br />
* <span>'''LTSM''': The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of <math>1</math>. This solution however does not deal with the exploding gradient</span><br />
* <span>'''Hessian-Free optimizer with structural damping''': Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.</span><br />
* <span>'''Echo State Networks''': <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.</span><br />
<br />
== Scaling Down the Gradients ==<br />
<br />
[[Image:gradient_clipping.png|frame|center|400px]]<br />
<br />
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.<br />
<br />
== Vanishing Gradient Regularization ==<br />
<br />
The vanishing gradient regularizer is as follows:<br />
<br />
<math>\Omega <br />
= \sum_{k} \Omega_{k} <br />
= \sum_{k} <br />
\left( <br />
\frac{<br />
\| <br />
\frac{\delta \varepsilon}{\delta x_{k + 1}} <br />
\frac{\delta x_{k + 1}}{\delta x_{k}}<br />
\|<br />
}<br />
{<br />
\|<br />
\frac{\delta \varepsilon}{\delta x_{k + 1}}<br />
\| <br />
} - 1<br />
\right)^{2}</math><br />
<br />
The vanishing gradient problem occurs when at time <math>t</math> the inputs <math>u</math> may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs <math>u_{t} \dots u_{k}</math> could be increased by increasing the norm of <math>\frac{\delta x_t}{\delta x_{t}}</math>. This imples that in order to increses the <math>\frac{\delta x_t}{\delta x_{t}}</math> norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices <math>\frac{\delta x_{k + 1}}{\delta x_{k}}</math> to preserve norm in the direction of the error <math>\frac{\delta \varepsilon}{\delta x_{k + 1}}</math>.<br />
<br />
= Experimental Results =<br />
<br />
== The Temporal Order Problem ==<br />
<br />
The authors repeated the temporal order problem as the prototypical pathological problem for validating the cliping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an <math>A</math> or a <math>B</math> symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of <math>A</math> and <math>B</math> at the end of the sequence.<br />
<br />
Three different RNN intializations were performed for the experiment:<br />
<br />
* <span>'''sigmoid unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
* <span>'''basic tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1)</math></span><br />
* <span>'''smart tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
<br />
Of the three RNN networks three different optimizer configurations were used:<br />
<br />
* <span>'''MSGD''': Mini-batch Stochastic Gradient Decent</span><br />
* <span>'''MSGD-C''': MSGD with Gradient Clipping</span><br />
* <span>'''MSGD-CR''': MSGD-C with Regularization</span><br />
<br />
Additional model parameters include:<br />
<br />
* <span>Spectral Radius of 0.95</span><br />
* <span><math>b = b_{out} = 0</math></span><br />
* <span>50 hidden neurons</span><br />
* <span>constant learning rate of 0.01</span><br />
* <span>clipping threshold of 1 (only for MSGD-C and MSGD-CR)</span><br />
* <span>regularization weight of 4 (MSGD-CR)</span><br />
<br />
The experiment was performed 5 times, from the figure below we can observe the importance of gradient cliping and the regularizer, in all cases the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding graidents correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important. This is largely due to the fact that increased memory yields a larger spectral radius, which in turn leads to increased likelihood of gradient explosion.<br />
<br />
[[Image:experimental_results.png|frame|center|400px]]<br />
<br />
== Other Pathological Problems ==<br />
<br />
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:<br />
<br />
[[Image:experimental_results_2.png|image]]<br />
<br />
The authors did not discuss the experimental results in detail here.<br />
<br />
= Summary =<br />
<br />
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer, their experimental results show that in all cases except for the Penn Treebank dataset, that cliping and regularizer has bested the state of the art for RNN in their respective experiment performances.</div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_difficulty_of_training_recurrent_neural_networks&diff=27069on the difficulty of training recurrent neural networks2015-12-04T04:17:13Z<p>Rtwang: </p>
<hr />
<div>= Introduction =<br />
<br />
Training Recurrent Neural Network (RNN) is difficult and two of the most prominent problems have been vanishing and exploding gradients. <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages<br />
1183–1188. IEEE, 1993.</ref> which prevents neural networks from learning and fitting data with long-term dependencies. In this paper the authors propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem.<br />
<br />
= Background =<br />
<br />
[[Image:rnn_2.png|frame|center|400px|alt=| Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>]]<br />
A generic recurrent neural network has the form:<br />
<br />
<math>x_{t} = F(\mathbf{x}_{t -1}, \mathbf{u}_{t}, \theta)</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{x}_{t}</math> is the state at time <math>t</math></span><br />
* <span><math>\mathbf{u}_{t}</math> is the input at time <math>t</math></span><br />
* <span><math>\theta\,</math> are the parameters</span><br />
* <span><math>F()\,</math> is the function that represents a neuron</span><br />
<br />
In the theoreical sections the authors made use of specific parameterization:<br />
<br />
<math>\mathbf{x}_{t} = \mathbf{W}_{rec} \sigma(\mathbf{x}_{t - 1}) + \mathbf{W}_{in} \mathbf{u}_{t} + b</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{W}_{rec}</math> is the RNN weight matrix</span><br />
* <span><math>\sigma()\,</math> is an element wise function</span><br />
* <span><math>b\,</math> is the bias</span><br />
* <span><math>\mathbf{W}_{in}</math> is the input weights matrix</span><br />
<br />
The following are gradient equations for using the Back Propagation Through Time (BPTT) algorithm. The authors rewrote the equations in order to highlight the exploding gradients problem:<br />
<br />
<math>\frac{\partial \varepsilon}{\partial \theta} = <br />
\sum_{1 \leq t \leq T} \frac{\partial \varepsilon_t}{\partial \theta}</math><br />
<br />
<math>\frac{\partial \varepsilon_{t}}{\partial \theta} = <br />
\sum_{1 \leq k \leq T} <br />
\left(<br />
\frac{\partial \varepsilon_{t}}{\partial x_{t}}<br />
\frac{\partial x_{t}}{\partial x_{k}}<br />
\frac{\partial^{+} x_{k}}{\partial \theta}<br />
\right)</math><br />
<br />
<math>\frac{\partial x_{t}}{\partial x_{k}} =<br />
\prod_{t \leq i \leq k} \frac{\partial x_{i}}{\partial x_{i - 1}} =<br />
\prod_{t \leq i \leq k} <br />
\mathbf{W}^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1}))</math><br />
<br />
Where:<br />
<br />
* <span><math>\varepsilon_{t}</math> is the error obtained from output at time <math>t</math></span><br />
* <span><math>\frac{\partial^{+} \mathbf{x}_{k}}{\partial \theta}</math> is the immediate partial derivative of state <math>\mathbf{x}_{k}</math></span>. For the parameterization above, <math>\frac{\partial^+ \mathbf{x}_k}{\partial \mathbf{W}_{rec}} = \sigma(\mathbf{x}_{k-1})</math>.<br />
<br />
The authors of this paper also distinguish between ''long-term'' and ''short-term'' contributions to the gradient with respect to <math>\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_k}</math>. The contribution is ''long-term'' if <math>k \ll t</math>, and ''short-term'' otherwise.<br />
<br />
== Exploding and Vanishing Gradients ==<br />
=== The Mechanics ===<br />
<br />
It's known that <math> |\sigma^'(x)| </math> is bounded. Let <math>\left|\left|diag(\sigma^'(x_k))\right|\right| \leq \gamma \in R</math>.<br />
<br />
The paper first proves that it is sufficient for <math> \lambda_1 < \frac{1}{\gamma} </math>, where <math> \lambda_1 </math> is the largest singular value of <math> \bold{W}_{rec} </math>, for the vanishing gradient problem to occur. The Jacobian matrix <math> \frac{\partial x_{k+1}}{\partial x_k} </math> is given by <math> \bold{ W}_{rec}^{T}diag(\sigma^'(x_k)) </math>. Then, the 2-norm of this Jacobian is bounded by the product of the norms of the two matrices. This leads to <math> \forall k, ||\frac{\partial{x_{k+1}}}{\partial x_k}|| \leq ||\bold{W}_{rec}^T||||diag(\sigma^'(x_k))|| < \frac{1}{\gamma}\gamma < 1</math><br />
<br />
Let <math>\eta \in R</math> be such that <math>\forall k, ||\frac{\partial {x_{k+1}}}{\partial x_k}|| \leq \eta < 1</math>. By induction over <math>i</math>, we can show that <math>||\frac{\partial \varepsilon_t}{\partial x_t}(\prod_{i=k}^{t-1}{\frac{\partial x_{i+1}}{\partial x_i}})|| \leq \eta^{t-k}||\frac{\partial \varepsilon_t}{\partial x_t}||</math>. Since <math> \eta < 1 </math>, as <math> t-k </math> goes larger, the gradient goes to 0.<br />
<br />
By inverting this proof, it also shows that when the largest singular value <math>\lambda_1 </math> is larger than <math> \frac{1}{\gamma}</math>, we will have exploding gradients (otherwise the long term components would vanish instead of exploding).<br />
<br />
=== From a dynamical systems perspective ===<br />
<br />
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “<math>\theta</math> changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.<br />
<br />
The authors of this paper argue, however, that crossing these bifurcation points does not guarantee a sudden chage in gradients. Their idea is that a change to the model parameters can alter the attractor landscape in such a way that basin of attraction corresponding to the current model state is unaltered. For example, a change to the model parameters might eliminate a basic of attraction in a portion of the model's state space that is very far from its current state. In this case, the bifurcation will have no effect on the asymptotic behaviour of the model, and there will accordingly be no gradient explosion. On the other hand, if a change to the model parameters substantially alters the final basin of attraction given the current state, then there will a considerable effect on the asymptotic behaviour of the model, and the gradients will accordingly explode. <br />
<br />
[[Image:dynamic_perspective.png|frame|center|400px|alt=|Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>]]<br />
<br />
The figure above depicts a bifurcation diagram for a single-unit RNN, where the x-axis is the parameter <math>b</math> (bias) and the y-axis is the asymptotoc state <math>x_{\infty}</math>, (i.e. the equilibrium activation value of the unit), and the plot line is the movement of the final point attractor <math>x_{\infty}</math> as <math>b</math> changes. What this figure represents is the presence of two attractors, one emerging from <math>b_1</math> and another disappearing at <math>b_2</math>, as the value of <math>b</math> is decreased. Note that only one attractor exists when the value of <math>b</math> is outside of the interval between <math>b_1</math> and <math>b_2</math>, and that when two attractors exist, the attractor state towards which the unit ultimately gravitates is determined by its initial starting state. The boundary between the these two basins of attraction is denoted with the dashed line - starting states on opposite sides of this boundary will gravitate towards different attractor states. The blue filled circles indicate a bifurcation point at which a small change to the value of <math>b</math> can have a drastic effect on the attractor landscape over the unit's state space. In short, the landscape shifts to include a single attractor state for a low value of <math>x</math>. The unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time <math>0</math>, a small change in <math>b</math> would result in a sudden large change in <math>x_{t}</math>.<br />
<br />
Overall, these remarks indicate that, when treated as dynamical system, the behaviour of a RNN can be analyzed with respect to both changes to the parameter values that determine an attractor landscape over its state space (assuming a fixed starting state), and with respect to changes to the starting state (assuming a fixed attractor landscape). <br />
<br />
=== From a geometric perspective ===<br />
<br />
Aside from a dynamical systems prospective to exploding and vanishing gradients, the authors also considered a geometric perspective, where a simple one hidden unit RNN was considered.<br />
<br />
[[Image:geometric_perspective.png|frame|center|400px|alt=|Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>]]<br />
<br />
Reusing the RNN state equation the authors defined:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
By assuming no input, with <math>b = 0</math> and initial state <math>x_{0}</math>, the equation simplifies to:<br />
<br />
<math>x_{t} = W_{rec}^{t} x_{0}</math><br />
<br />
Differentiating the above equation to the first and second order would give:<br />
<br />
<math>\frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0}</math><br />
<br />
<math>\frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0}</math><br />
<br />
Which implies if the first order derivative explodes so will the second derivative. In the general case, when they gradients explode they do so along some directions '''v'''. If this bound is tight, it is hypothesized that ''when gradients explode so does the curvature along'' '''v''''', leading to a wall in the error surface'', like the one seen above. If both the gradient and the leading eigenvector of the curvature are aligned with the exploding direction '''v''', it follows that the error surface has a steep wall perpendicular to '''v''' (and consequently to the gradient). This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away, possibly hindering the learning process. (See figure above). Note that this solution assumes that the valley bordered by a steep cliff in the value of the loss function is wide enough with respect the clip being applied to the gradient - otherwise, the deflection caused by an update of SGD would still hinder the learning process, even when clipping is used. The practical effectiveness of clipping provides some evidence in support of this assumption.<br />
<br />
= Dealing with Exploding and Vanishing Gradient =<br />
<br />
== Related Work ==<br />
<br />
* <span>'''L1/L2 Regularization''': Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.</span><br />
* <span>'''Teacher Forcing''': Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.</span><br />
* <span>'''LTSM''': The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of <math>1</math>. This solution however does not deal with the exploding gradient</span><br />
* <span>'''Hessian-Free optimizer with structural damping''': Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.</span><br />
* <span>'''Echo State Networks''': <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.</span><br />
<br />
== Scaling Down the Gradients ==<br />
<br />
[[Image:gradient_clipping.png|frame|center|400px]]<br />
<br />
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.<br />
<br />
== Vanishing Gradient Regularization ==<br />
<br />
The vanishing gradient regularizer is as follows:<br />
<br />
<math>\Omega <br />
= \sum_{k} \Omega_{k} <br />
= \sum_{k} <br />
\left( <br />
\frac{<br />
\| <br />
\frac{\delta \varepsilon}{\delta x_{k + 1}} <br />
\frac{\delta x_{k + 1}}{\delta x_{k}}<br />
\|<br />
}<br />
{<br />
\|<br />
\frac{\delta \varepsilon}{\delta x_{k + 1}}<br />
\| <br />
} - 1<br />
\right)^{2}</math><br />
<br />
The vanishing gradient problem occurs when at time <math>t</math> the inputs <math>u</math> may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs <math>u_{t} \dots u_{k}</math> could be increased by increasing the norm of <math>\frac{\delta x_t}{\delta x_{t}}</math>. This imples that in order to increses the <math>\frac{\delta x_t}{\delta x_{t}}</math> norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices <math>\frac{\delta x_{k + 1}}{\delta x_{k}}</math> to preserve norm in the direction of the error <math>\frac{\delta \varepsilon}{\delta x_{k + 1}}</math>.<br />
<br />
= Experimental Results =<br />
<br />
== The Temporal Order Problem ==<br />
<br />
The authors repeated the temporal order problem as the prototypical pathological problem for validating the cliping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an <math>A</math> or a <math>B</math> symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of <math>A</math> and <math>B</math> at the end of the sequence.<br />
<br />
Three different RNN intializations were performed for the experiment:<br />
<br />
* <span>'''sigmoid unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
* <span>'''basic tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1)</math></span><br />
* <span>'''smart tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
<br />
Of the three RNN networks three different optimizer configurations were used:<br />
<br />
* <span>'''MSGD''': Mini-batch Stochastic Gradient Decent</span><br />
* <span>'''MSGD-C''': MSGD with Gradient Clipping</span><br />
* <span>'''MSGD-CR''': MSGD-C with Regularization</span><br />
<br />
Additional model parameters include:<br />
<br />
* <span>Spectral Radius of 0.95</span><br />
* <span><math>b = b_{out} = 0</math></span><br />
* <span>50 hidden neurons</span><br />
* <span>constant learning rate of 0.01</span><br />
* <span>clipping threshold of 1 (only for MSGD-C and MSGD-CR)</span><br />
* <span>regularization weight of 4 (MSGD-CR)</span><br />
<br />
The experiment was performed 5 times, from the figure below we can observe the importance of gradient cliping and the regularizer, in all cases the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding graidents correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important. This is largely due to the fact that increased memory yields a larger spectral radius, which in turn leads to increased likelihood of gradient explosion.<br />
<br />
[[Image:experimental_results.png|frame|center|400px]]<br />
<br />
== Other Pathological Problems ==<br />
<br />
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:<br />
<br />
[[Image:experimental_results_2.png|image]]<br />
<br />
The authors did not discuss the experimental results in detail here.<br />
<br />
= Summary =<br />
<br />
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer, their experimental results show that in all cases except for the Penn Treebank dataset, that cliping and regularizer has bested the state of the art for RNN in their respective experiment performances.</div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=imageNet_Classification_with_Deep_Convolutional_Neural_Networks&diff=26908imageNet Classification with Deep Convolutional Neural Networks2015-11-26T23:26:14Z<p>Rtwang: /* Results */</p>
<hr />
<div>== Introduction ==<br />
<br />
In this paper, they trained a large, deep neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. To learn about thousands of objects from millions of images, Convolutional Neural Network (CNN) is utilized due to its large learning capacity, fewer connections and parameters and outstanding performance on image classification.<br />
<br />
Moreover, current GPU provides a powerful tool to facilitate the training of interestingly-large CNNs. Thus, they trained one of the largest convolutional neural networks to date on the datasets of ILSVRC-2010 and ILSVRC-2012 and achieved the best results ever reported on these datasets by the time this paper was written.<br />
<br />
The code of their work is available here<ref><br />
[http://code.google.com/p/cuda-convnet/ "High-performance C++/CUDA implementation of convolutional neural networks"]<br />
</ref>.<br />
<br />
== Dataset ==<br />
<br />
ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has roughly 1.2 million labeled high-resolution training images, 50 thousand validation images, and 150 thousand testing images over 1000 categories.<br />
<br />
In this paper, the images in this dataset are down-sampled to a fixed resolution of 256 x 256. The only image pre-processing they used is subtracting the mean activity over the training set from each pixel.<br />
<br />
== Architecture ==<br />
<br />
=== ReLU Nonlinearity ===<br />
<br />
Non-saturating nonlinearity ''f(x) = max(0,x)'' also known as Rectified Linear Units (ReLUs)<ref><br />
Nair V, Hinton G E. [http://machinelearning.wustl.edu/mlpapers/paper_files/icml2010_NairH10.pdf Rectified linear units improve restricted boltzmann machines.] Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010: 807-814.<br />
</ref> is used as the nonlinearity function, which works several times faster than equivalents with those standard saturating neurons.Neural networks are usually ill-conditioned and they converge very slowly. By using nonlinearities such as rectifiers (maxpooling units), gradients flow along a few paths instead of all possible paths resulting to faster convergence. Thus, better performance can be achieved by reducing the training time for each epoch and training larger datasets to prevent overfitting. <br />
Deep convolutional neural networks<br />
with ReLUs train several times faster than their<br />
equivalents with tanh units. The following figure illustrates this. The shows the number of iterations required<br />
to reach 25% training error on the CIFAR-10<br />
dataset for a particular four-layer convolutional network.<br />
<br />
[[File:Fig1.png]]<br />
<br />
A four-layer convolutional neural<br />
network with ReLUs (solid line) reaches a 25%<br />
training error rate on CIFAR-10 six times faster<br />
than an equivalent network with tanh neurons<br />
(dashed line). The learning rates for each network<br />
were chosen independently to make training<br />
as fast as possible. No regularization of<br />
any kind was employed. The magnitude of the<br />
effect demonstrated here varies with network<br />
architecture, but networks with ReLUs consistently<br />
learn several times faster than equivalents<br />
with saturating neurons.<br />
<br />
=== Training on Multiple GPUs ===<br />
<br />
They spread the net across two GPUs by putting half of the kernels (or neurons) on each GPU and letting GPUs communicate only in certain layers. Choosing the pattern of connectivity could be a problem for cross-validation, so they tune the amount of communication precisely until it is an acceptable fraction of the amount of computation.<br />
<br />
=== Local Response Normalization ===<br />
<br />
ReLUs have the desirable property that they do not require input normalization to prevent them from saturating. However, they find that a local response normalization scheme after applying the ReLU nonlinearity can reduce their top-1 and top-5 error rates by 1.4% and 1.2%.<br />
<br />
The response normalization is given by the expression<br />
<br />
<math>b_{x,y}^{i}=a_{x,y}^{i}/\left ( k+\alpha \sum_{j=max\left ( 0,i-n/2 \right )}^{min\left ( N-1,i+n/2 \right )}\left ( a_{x,y}^{i} \right )^{2} \right )^{\beta }</math><br />
<br />
where the sum runs over n “adjacent” kernel maps at the same spatial position. This response normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels.<br />
<br />
The constants k, n, α, and β are hyper-parameters whose values are determined using a validation set; k = 2, n = 5, α = 10−4 , and β = 0.75 were used in this research. This normalization was used after applying the ReLU nonlinearity in certain layers<br />
<br />
=== Overlapping Pooling ===<br />
<br />
Unlike traditional non-overlapping pooling, they use overlapping pooling throughout their network, with pooling window size z = 3 and stride s = 2. This scheme reduces their top-1 and top-5 error rates by 0.4% and 0.3% and makes the network more difficult to overfit.<br />
<br />
=== Overall Architecture ===<br />
<br />
[[File:network.JPG | center]]<br />
<br />
As shown in the figure above, the net contains eight layers with 60 million parameters; the first five are convolutional and the remaining three are fully connected layers. The first convolutional layer filters the 224 × 224 × 3 input image with 96 kernels of size 11 × 11 × 3 with a stride of 4 pixels (this is the distance between the receptive field centers of neighboring neurons in a kernel map). The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 × 5 × 48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 × 3 × 256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth convolutional layer has 384 kernels of size 3 × 3 × 192, and the fifth convolutional layer has 256 kernels of size 3 × 3 × 192. The fully-connected layers have 4096 neurons each. The output of the last layer is fed to a 1000-way softmax. Their network maximizes the average across training cases of the log-probability of the correct label under the prediction distribution.<br />
<br />
Response-normalization layers follow the first and second convolutional layers. Max-pooling layers follow both response-normalization layers as well as the fifth convolutional layer. The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.<br />
<br />
== Reducing overfitting ==<br />
<br />
=== Data Augmentation ===<br />
<br />
The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations. In this paper, the transformed images are generated on CPU while GPU is training and do not need to be stored on disk.<br />
<br />
The first form of data augmentation consists of generating image translations and horizontal reflections.<br />
They extract a random 224 x 224 patches (and their horizontal reflections) from the 256 x 256 images and training the network on these extracted patches. They also perform principal components analysis (PCA) on the set of RGB pixel values. To each training image, multiples of the found principal components, with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a Gaussian with mean zero and standard deviation 0.1 are added.Therefore to each RGB image pixel the following quantity is added<br />
<br />
[[File:Fig2.png]]<br />
<br />
This scheme helps to capture the object identity invariant with respect to its intensity and color, which reduces the top-1 error rate by over 1%.<br />
<br />
=== Dropout ===<br />
<br />
The “dropout” technique is implemented in the first two fully-connected layers by setting to zero the output of each hidden neuron with probability 0.5. This scheme roughly doubles the number of iterations required to converge. However, it forces the network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.<br />
<br />
== Details of leaning ==<br />
<br />
They trained the network using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. The update rule for weight w was<br />
<br />
<math>v_{i+1}:=0.9\cdot v_{i}-0.0005\cdot \epsilon \cdot w_{i}-\epsilon \cdot \left \langle \frac{\partial L}{\partial w}|_{w_{i}} \right \rangle_{D_{i}}</math><br />
<br />
<math>w_{i+1}:=w_{i}+v_{i+1}</math><br />
<br />
where <math>v</math> is the momentum variable, <math>\epsilon</math> is the learning rate which is adjusted manually throughout training. The weights in each layer are initialized from a zero-mean Gaussian distribution with standard deviation 0.01. The biases in the second, fourth, fifth convolutional layers and fully-connected hidden layers are initialized by 1, while those in the remaining layers are set by 0. This initialization accelerates<br />
the early stages of learning by providing the ReLUs with positive inputs. The neuron<br />
biases in the remaining layers were initialized with the constant 0. Initializing the network with sparse weights is the other thing that reduces the ill-conditioning issue and helps this network work well.<br />
An equal learning rate was used for all layers, which was adjusted manually throughout training.<br />
The heuristic which was followed was to divide the learning rate by 10 when the validation error<br />
rate stopped improving with the current learning rate. The learning rate was initialized at 0.01 and<br />
6<br />
reduced three times prior to termination. The network was trained for roughly 90 cycles through the<br />
training set of 1.2 million images, which took five to six days on two NVIDIA GTX 580 3GB GPUs<br />
<br />
== Results ==<br />
<br />
For ILSVRC-2010 dataset, their network achieves top-1 and top-5 test set error rates of 37.5% and 17.0%, which was the state of the art at that time.<br />
<br />
The following table shows the results<br />
<br />
[[File:Tt1.png]]<br />
<br />
Comparison of results on ILSVRC-<br />
2010 test set. In italics are best results<br />
achieved by others.<br />
<br />
For LSVRC-2012 dataset, the CNN described in this paper achieves a top-5 error rate of 18.2%. Averaging the predictions of five similar CNNs gives an error rate of 16.4%. The following table summarizes the results for the LSVRC Dataset<br />
<br />
[[File:Tt3.png]]<br />
<br />
<br />
<br />
The following figure shows the learnt kernels<br />
<br />
[[File:Figg3.png]]<br />
<br />
96 convolutional kernels of size<br />
11×11×3 learned by the first convolutional<br />
layer on the 224×224×3 input images. The<br />
top 48 kernels were learned on GPU 1 while<br />
the bottom 48 kernels were learned on GPU<br />
2. See Section 6.1 for details.<br />
<br />
=== Image Retrieval ===<br />
<br />
The convolutional network predicts the image's class based on the last hidden layer with 4096 nodes. If two different images have very similar activation values for these 4096 hidden nodes then the convolutional network would predict the same class for both images and would treat them as the very similar images. Since the network is pretty accurate based on the results, we can expect that if two images do have very similar values in the last node, they would correspond to the same class.<br />
<br />
Based on this, the network actually provides an excellent way of mapping images to a 4096 dimension vector such that images with same class should have similar values. This means that after the network has been trained, images can be inputted into this network and their 4096 dimension vector stored. Afterwards, for the process of image retrieval, i.e. retrieve similar images based on an image, it is a simple matter of finding other images with similar vectors based on measures such as Euclidean distance. This can be seen when the researchers calculated the closest Euclidean distance image vectors for several images to retrieve bunch of similar images and generated the following:<br />
<br />
[[File:Similarimg.PNG]]<br />
<br />
This has a strong advantage over encoder methods in that it actually accounts for the meaning of the image, i.e. type of object, rather than just similarities based on colour or shape and can be seen in the above where despite large differences in shading, angle, and colour, it still managed to retrieve images containing the same object. An issue though is that calculating Euclidean distance for large numbers of 4096 dimension vectors is not very efficient and the researchers proposed mapping these vectors further to an auto encoder with values constricted to 0 or 1. This means that all images would be mapped to a binary code of 0s and 1s.<br />
<br />
== Discussion ==<br />
<br />
1. The main techniques that allowed this success include the following: efficient GPU training, number of labeled examples, convolutional architecture with max-pooling , rectifying non-linearities , careful initialization , careful parameter update and adaptive learning rate heuristics, layerwise feature normalization , and a dropout trick based on injecting strong binary multiplicative noise on hidden units. <br />
<br />
2. It is notable that their network’s performance degrades if a single convolutional layer is removed. So the depth of the network is important for achieving their results.<br />
<br />
3. Their experiments suggest that the results can be improved simply by waiting for faster GPUs and bigger datasets to become available.<br />
<br />
== Bibliography ==<br />
<references /></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Similarimg.PNG&diff=26907File:Similarimg.PNG2015-11-26T23:21:42Z<p>Rtwang: </p>
<hr />
<div></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Neural_Nets_as_a_Method_for_Quantitative_Structure%E2%80%93Activity_Relationships&diff=26896deep Neural Nets as a Method for Quantitative Structure–Activity Relationships2015-11-26T22:04:07Z<p>Rtwang: /* Regularization */</p>
<hr />
<div>== Introduction ==<br />
This abstract is a summary of the paper "Deep Neural Nets as a Method for Quantitative Structure−Activity Relationships" by Ma J. et al. <ref> Ma J, Sheridan R. et al. [ http://pubs.acs.org/doi/pdf/10.1021/ci500747n.pdf "QSAR deep nets"] Journal of Chemical Information and Modeling. 2015,55, 263-274</ref>. The paper presents the application of machine learning methods, specifically Deep Neural Networks <ref> Hinton, G. E.; Osindero, S.; Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation 2006, 18, 1527−1554</ref> and Random Forest models <ref> Breiman L. Random Forests, Machine Learning. 2001,45, 5-32</ref> in the field of pharmaceutical industry. To discover a drug, it is needed that the best combination of different chemical compounds with different molecular structure was selected in order to achieve the best biological activity. Currently the SAR (QSAR) models are routinely used for this purpose. Structure-Activity Relationship (SAR), or Quantified SAR, is an approach designed to find relationships between chemical structure and biological activity (or target property) of studied compounds. The SAR models are type of classification or regression models where the predictors consist of physio-chemical properties or theoretical molecular and the response variable could be a biological activity of the chemicals, such as concentration of a substance required to give a certain biological response. The basic idea behind these methods is that activity of molecules is reflected in their structure and same molecules have the same activity. So if we learn the activity of a set of molecules structures ( or combinations of molecules) then we can predict the activity of similar molecules. QSAR methods are particularly computer intensive or require the adjustment of many sensitive parameters to achieve good prediction.In this sense, the machine learning methods can be helpful and two of those methods: support vector machine (SVM) and random forest (RF) are commonly used <ref>Svetnik, V. et al.,[http://pubs.acs.org/doi/pdf/10.1021/ci034160g.pdf Random forest: a classification and regression tool for compound classification and QSAR modeling,J. Chem. Inf. Comput. Sci.<br />
2003, 43, 1947−1958 </ref>. In this paper the authors investigate the prediction performance of DNN as a QSAR method and compare it with RF performance that is somehow considered as a gold standard in this field. <br />
<br />
<br />
== Motivation ==<br />
At the first stage of drug discovery there are a huge number of candidate compounds that can be combined to produce a new drug. This process may involve a large number of compounds (>100 000) and a large number of descriptors (several thousands) that have different biological activity. Predicting all biological activities for all compounds need a lot number of experiments. The in silico discovery and using the optimization algorithms can substantially reduce the experiment work that need to be done. It was hypothesized that DNN models outperform RF models. <br />
<br />
== Methods ==<br />
In order to compare the prediction performance of methods, DNN and RF fitted to 15 data sets from a pharmaceutical company, Merck. The smallest data set has 2092 molecules with 4596 unique AP, DP descriptors. Each molecule is represented by a list of features, i.e. “descriptors” in QSAR nomenclature. The descriptors are substructure descriptors (e.g., atom pairs (AP), MACCS keys, circular fingerprints, etc.) and donor-descriptors (DP). Both descriptors are of the following form:<br />
<br />
atom type i − (distance in bonds) − atom type j<br />
<br />
Where for AP, atom type includes the element, number of nonhydrogen neighbors, and number of pi electrons. For DP, atom type is one of seven (cation, anion, neutral donor, neutral acceptor, polar, hydrophobe, and other). A separate group of 15 different data sets as Additional Data Sets were used to validate the conclusions acquired from the Kaggle data sets. Each of these data sets was split into train and test set. The metric to evaluate prediction performance of methods is coefficient of determination (<math>R^2</math>). <br />
<br />
To run a RF, 100 trees were generated with m/3 descriptors used at each branch-point, where m was the number of unique descriptors in the training set. The tree nodes with 5 or fewer molecules were not split further. The trees parallelized to run one tree per processor on a cluster to run larger data sets in a reasonable time.<br />
<br />
The DNNs with input descriptors X of a molecule and output of the form <math>O=f(\sum_{i=1}^{N} w_ix_i+b)</math> were fitted to data sets. Since many different parameters, such as number of layers, neurons, influence the performance of a deep neural net, Ma and his colleagues did a sensitivity analysis. They trained 71 DNNs with different parameters for each set of data. the parameters that they were considered were parameters related to: <br />
<br />
-Data (descriptor transformation: no transformation, logarithmic transformation, or binary transformation. <br />
<br />
-Network architecture: number of hidden layers, number of neurons in each hidden layer.<br />
<br />
-Activation functions: sigmoid or rectified linear unit.<br />
<br />
-The DNN training strategy: single training set or joint from multiple sets, percentage of neurons to drop-out in each layer.<br />
<br />
-The mini-batched stochastic gradient descent procedure in the BP algorithm: the minibatch size, number of epochs<br />
<br />
-Control the gradient descent optimization procedure: learning rate, momentum strength, and weight cost strength.<br />
<br />
In addition to the effect of these parameters on the DNN, the authors were interested in evaluating consistency of results for a diverse set of QSAR tasks. Due to time-consuming process of evaluating the effect of the large number of adjustable parameters, a reasonable number of parameter settings were selected by adjusting the values of one or two parameters at a time, and then calculate the <math>R^2</math> for DNNs trained with the selected parameter settings. These results allowed them to focus on a smaller number of parameters, and to finally generate a set of recommended values for all algorithmic parameters, which can lead to consistently good predictions. <br />
<br />
=== Regularization ===<br />
<br />
A very common problem with deep neural networks is overfitting as the number of weights can increase exponentially with more layers and nodes. The researchers considered two methods for this issue, dropout which was described in a previous summary and pre-training.<br />
<br />
The general method for pre-training goes as follows:<br />
<br />
1. Break down the deep neural network into its subsequent layers.<br />
<br />
2. For each layer, take the input (either data or previous layer output) and train the layer to project the input in a way that captures the maximum amount of variation similar to <br />
dimension reduction techniques such as PCA. This was usually done with either auto-encoders by encoding the input in a lower dimension or Restricted Boltzmann machines.<br />
<br />
3. After each layer has been trained this way, the parameters of the model are now initialized with some set of weights that depend on the data.<br />
<br />
The regularization of this works as follows, consider the surface of the objective function based on weights, due to the complexity of neural networks, this surface is going to vary significantly throughout and would contain many local minimas. Gradient descent tends to get trapped in local minimas and it can be difficult to reach a better minima with random weights. The hope is that by training the deep neural network to capture almost all of the variation of the data, the set of weights resulting from training would be near a good local minima and it could then calibrate through gradient descent to the optimal solution. This would be similar to the idea of combining PCA with some other classifier, i.e. first map the points to a subspace that is easily linearly separable then the classifier could easily classify. This can also be thought of as, once the first few layers projects the points to an easier linearly separable subspace, subsequent layers in the network can work on classifying these projected points. If these set of pre-trained weights are near a local minima, gradient descent would heavily restrict their range of values since it would travel towards the minima immediately and this restriction of values acts as a regularizer on the whole neural network.<br />
<br />
However, when the researchers tried this with some modifications to accommodate their code, it did not improve results.<br />
<br />
== Results ==<br />
<br />
For the first object of this paper that was comparing the performance of DNNs to Rf, over over 50 DNNs were trained using different parameter settings. These parameter settings were arbitrarily selected, but they attempted to cover a sufficient range of values for each adjustable parameter. Figure 1 shows the difference in <math>R^2</math> between DNNs and RF for each kaggle data set. Each column represents a QSAR data set, and each circle represents the improvement of a DNN over RF.<br />
<br />
<br />
<center><br />
[[File: fig1.PNG | frame | center |Figure 1. Overall DNN vs RF using arbitrarily selected parameter values. Each column represents a QSAR data set, and each circle represents the<br />
improvement, measured in <math>R^2</math>, of a DNN over RF ]]<br />
</center><br />
<br />
comparing the performance of different models shows that even when the worst DNN parameter setting was used for each QSAR task, the average R2 would be degraded only from 0.423 to 0.412, merely a 2.6% reduction. These results suggest that DNNs can generally outperform RF( table below).<br />
<br />
<br />
<center><br />
[[File: table1.PNG | frame | center |Table 1. comparing test <math>R^2</math> of different models ]]<br />
</center><br />
<br />
The difference in <math>R^2</math> between DNN and RF by changing the the network architecture is shown in Figure 2. In order to limit the number of different parameter combinations they fixed the number of neurons in each hidden layer. Thirty two DNNs were trained for each data set by varying number of hidden layers and number of neurons in each layer while the other key adjustable parameters were kept unchanged. It is seen that when the number of hidden layers are two, having a small number of neurons in the layers degrade the predictive capability of DNNs. It can also be seen that, given any number of hidden layers, once the number of neurons per layer is sufficiently large, increasing the number of neurons further has only a marginal benefit. In Figure 2 we can see that the neural network with only one hidden layer and 12 neurons in each layer achieved the same average predictive capability as RF . This size of neural network is indeed comparable with that of the classical neural network used in QSAR.<br />
<br />
<center><br />
[[File: fig2.PNG | frame | center |Figure 2. Impacts of Network Architecture. Each marker in the plot represents a choice of DNN network architecture. The markers share the same number of hidden layers are connected with a line. The measurement (i.e., y-axis) is the difference of the mean R2 between DNNs and RF. ]]<br />
</center><br />
<br />
To decide which activation function, Sigmoid or ReLU, performs better, at least 15 pairs of DNNs were trained for each data set. Each pair of DNNs shared the same adjustable parameter settings, except that one DNN used ReLU as the activation function, while the other used Sigmoid function. The data sets where ReLU is significantly better than Sigmoid are colored in blue, and marked at the bottom with “+”s. The difference was tested by one-sample Wilcoxon test. In contrast, the data set where Sigmoid is significantly better than ReLU is colored in black, and marked at the bottom with “−”s( Figure 3). In 53.3% (8 out of 15) data sets, ReLU is statistically significantly better than Sigmoid. Overall ReLU improves the average <math>R^2</math> over Sigmoid by 0.016. <br />
<br />
<center><br />
[[File: fig3.PNG | frame | center |Figure 3. Choice of activation functions. Each column represents a QSAR data set, and each circle represents the difference, measured in <math>R^2</math>, of a pair of<br />
DNNs trained with ReLU and Sigmoid, respectively ]]<br />
</center><br />
<br />
Figure 4 presents the difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets. Average over all data sets, there seems to joint DNN has a better performance rather single training. However, the size of the training sets plays a critical role on whether a joint DNN is beneficial. For the two very largest data sets (i.e., 3A4 and LOGD), the individual DNNs seem better, indicating that joint DNNs are more proper for not much large data sets. <br />
<br />
<center><br />
[[File: fig4.PNG | frame | center |Figure 4. difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets ]]<br />
</center><br />
<br />
The authors refine their selection of DNN adjustable parameters by studying the results of previous runs. They used the logarithmic transformation, two hidden layers, at least 250 hidden layers an activation function of ReLU. The results are shown in Figure 5. Comparison of these results with those in Figure 1 indicates that now there are 9 out of 15 data sets, whereDNNs outperforms RF even with the “worst” parameter setting, compared with 4 out of 15. The <math>R^2</math> averaged over all DNNs and all 15 data sets is 0.051 higher than that of RF.<br />
<br />
<center><br />
[[File: fig5.PNG | frame | center |Figure 5. DNN vs RF with refined parameter settings ]]<br />
</center><br />
<br />
as a conclusion for the sensitivity analysis which had been done in this work, the authors gave a recommendation on the adjustable parameters of DNNs as below:<br />
-logarithmic transformation. <br />
<br />
-four hidden layers, with number of neurons to be 4000, 2000, 1000, and 1000, respectively.<br />
<br />
-The dropout rates of 0 in the input layer, 25% in the first 3 hidden layer, and 10% in the last hidden layer.<br />
<br />
-The activation function of ReLU.<br />
<br />
-No unsupervised pretraining. The network parameters should be initialized as random values.<br />
<br />
-Large number of epochs.<br />
<br />
-Learning rate of 0.05, momentum strength of 0.9, and weight cost strength of 0.0001.<br />
<br />
To check the consistency of DNNs predictions as was one of concerns of authors, they compared the performance of RF with DNN on 15 additional QSAR data sets. Each additional data set was time-split into training and test sets in the same way as the Kaggle data sets. Individual DNNs were trained from the training set using the recommended parameters.<math>R^2</math> of the DNN and RF were calculated from the test sets. Table below presents the results for the additional data sets. It is seen that the DNN with recommended parameters outperforms RF in 13<br />
out of the 15 additional data sets. The mean <math>R^2</math> of DNNs is 0.411, while that of RFs is 0.361, which is an improvement of 13.9%.<br />
<br />
<center><br />
[[File: table2.PNG | frame | center |Comparing RF with DNN trained using recommended parameter settings on 15 additional datasets]]<br />
</center><br />
<br />
Both RF and DNN can be efficiently speeded up using high-performance computing technologies, but in a different way due to the inherent difference in their algorithms. RF can be accelerated using coarse parallelization on a cluster by giving one tree per node. In contrast, DNN can efficiently make use of the parallel computation capability of a modern GPU. With the dramatic advance in GPU hardware and increasing availability of GPU computing resources, DNN can become comparable, if not more advantageous, to RF in various aspects, including easy implementation, computation time, and hardware cost.<br />
<br />
== Discussion ==<br />
This paper demonstrate that DNN in most cases can be used as a practical QSAR method in place of RF which is now as a gold standard in the field of drug discovery. Although, the magnitude of the change in coefficient of determination relative to RF is small in some data sets, on average its better than RF. The paper recommends a set of values for all DNN algorithmic parameters, which are appropriate for large QSAR data sets in an industrial drug discovery environment. The authors gave some recommendation about how RF and DNN can be efficiently sped up using high performance computing technologies. They suggest that RF can be accelerated using coarse parallelization on a cluster by giving one tree per node. In contrast, DNN can efficiently make use of the parallel computation capability of a modern GPU. <br />
<br />
== Future Works ==<br />
<br />
In opposite of our expectation that unsupervised pretraining plays a critical role in the success of DNNs, in this study it had an inverse effect on the performance of QSAR tasks which need to be worked.<br />
Although the paper had some recommendations about the adjustable parameters of DNNs, there is still need to develop an effective and efficient strategy for refining these parameters for each particular QSAR task.<br />
The result of current paper suggested that cross-validation failed to be effective for fine-tuning the algorithmic parameters. Therefore, instead of using automatic methods for tuning DNN parameters, new approaches that can better indicate a DNN’s predictive capability in a time-split test set are needed to be developed.<br />
<br />
== Bibliography ==<br />
<references /></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Neural_Nets_as_a_Method_for_Quantitative_Structure%E2%80%93Activity_Relationships&diff=26895deep Neural Nets as a Method for Quantitative Structure–Activity Relationships2015-11-26T22:03:46Z<p>Rtwang: </p>
<hr />
<div>== Introduction ==<br />
This abstract is a summary of the paper "Deep Neural Nets as a Method for Quantitative Structure−Activity Relationships" by Ma J. et al. <ref> Ma J, Sheridan R. et al. [ http://pubs.acs.org/doi/pdf/10.1021/ci500747n.pdf "QSAR deep nets"] Journal of Chemical Information and Modeling. 2015,55, 263-274</ref>. The paper presents the application of machine learning methods, specifically Deep Neural Networks <ref> Hinton, G. E.; Osindero, S.; Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation 2006, 18, 1527−1554</ref> and Random Forest models <ref> Breiman L. Random Forests, Machine Learning. 2001,45, 5-32</ref> in the field of pharmaceutical industry. To discover a drug, it is needed that the best combination of different chemical compounds with different molecular structure was selected in order to achieve the best biological activity. Currently the SAR (QSAR) models are routinely used for this purpose. Structure-Activity Relationship (SAR), or Quantified SAR, is an approach designed to find relationships between chemical structure and biological activity (or target property) of studied compounds. The SAR models are type of classification or regression models where the predictors consist of physio-chemical properties or theoretical molecular and the response variable could be a biological activity of the chemicals, such as concentration of a substance required to give a certain biological response. The basic idea behind these methods is that activity of molecules is reflected in their structure and same molecules have the same activity. So if we learn the activity of a set of molecules structures ( or combinations of molecules) then we can predict the activity of similar molecules. QSAR methods are particularly computer intensive or require the adjustment of many sensitive parameters to achieve good prediction.In this sense, the machine learning methods can be helpful and two of those methods: support vector machine (SVM) and random forest (RF) are commonly used <ref>Svetnik, V. et al.,[http://pubs.acs.org/doi/pdf/10.1021/ci034160g.pdf Random forest: a classification and regression tool for compound classification and QSAR modeling,J. Chem. Inf. Comput. Sci.<br />
2003, 43, 1947−1958 </ref>. In this paper the authors investigate the prediction performance of DNN as a QSAR method and compare it with RF performance that is somehow considered as a gold standard in this field. <br />
<br />
<br />
== Motivation ==<br />
At the first stage of drug discovery there are a huge number of candidate compounds that can be combined to produce a new drug. This process may involve a large number of compounds (>100 000) and a large number of descriptors (several thousands) that have different biological activity. Predicting all biological activities for all compounds need a lot number of experiments. The in silico discovery and using the optimization algorithms can substantially reduce the experiment work that need to be done. It was hypothesized that DNN models outperform RF models. <br />
<br />
== Methods ==<br />
In order to compare the prediction performance of methods, DNN and RF fitted to 15 data sets from a pharmaceutical company, Merck. The smallest data set has 2092 molecules with 4596 unique AP, DP descriptors. Each molecule is represented by a list of features, i.e. “descriptors” in QSAR nomenclature. The descriptors are substructure descriptors (e.g., atom pairs (AP), MACCS keys, circular fingerprints, etc.) and donor-descriptors (DP). Both descriptors are of the following form:<br />
<br />
atom type i − (distance in bonds) − atom type j<br />
<br />
Where for AP, atom type includes the element, number of nonhydrogen neighbors, and number of pi electrons. For DP, atom type is one of seven (cation, anion, neutral donor, neutral acceptor, polar, hydrophobe, and other). A separate group of 15 different data sets as Additional Data Sets were used to validate the conclusions acquired from the Kaggle data sets. Each of these data sets was split into train and test set. The metric to evaluate prediction performance of methods is coefficient of determination (<math>R^2</math>). <br />
<br />
To run a RF, 100 trees were generated with m/3 descriptors used at each branch-point, where m was the number of unique descriptors in the training set. The tree nodes with 5 or fewer molecules were not split further. The trees parallelized to run one tree per processor on a cluster to run larger data sets in a reasonable time.<br />
<br />
The DNNs with input descriptors X of a molecule and output of the form <math>O=f(\sum_{i=1}^{N} w_ix_i+b)</math> were fitted to data sets. Since many different parameters, such as number of layers, neurons, influence the performance of a deep neural net, Ma and his colleagues did a sensitivity analysis. They trained 71 DNNs with different parameters for each set of data. the parameters that they were considered were parameters related to: <br />
<br />
-Data (descriptor transformation: no transformation, logarithmic transformation, or binary transformation. <br />
<br />
-Network architecture: number of hidden layers, number of neurons in each hidden layer.<br />
<br />
-Activation functions: sigmoid or rectified linear unit.<br />
<br />
-The DNN training strategy: single training set or joint from multiple sets, percentage of neurons to drop-out in each layer.<br />
<br />
-The mini-batched stochastic gradient descent procedure in the BP algorithm: the minibatch size, number of epochs<br />
<br />
-Control the gradient descent optimization procedure: learning rate, momentum strength, and weight cost strength.<br />
<br />
In addition to the effect of these parameters on the DNN, the authors were interested in evaluating consistency of results for a diverse set of QSAR tasks. Due to time-consuming process of evaluating the effect of the large number of adjustable parameters, a reasonable number of parameter settings were selected by adjusting the values of one or two parameters at a time, and then calculate the <math>R^2</math> for DNNs trained with the selected parameter settings. These results allowed them to focus on a smaller number of parameters, and to finally generate a set of recommended values for all algorithmic parameters, which can lead to consistently good predictions. <br />
<br />
=== Regularization ===<br />
<br />
A very common problem with deep neural networks is overfitting as the number of weights can increase exponentially with more layers and nodes. The researchers considered two methods for this issue, dropout which was described in a previous summary and pre-training.<br />
<br />
The general method for pre-training goes as follows:<br />
1. Break down the deep neural network into its subsequent layers.<br />
2. For each layer, take the input (either data or previous layer output) and train the layer to project the input in a way that captures the maximum amount of variation similar to dimension reduction techniques such as PCA. This was usually done with either auto-encoders by encoding the input in a lower dimension or Restricted Boltzmann machines.<br />
3. After each layer has been trained this way, the parameters of the model are now initialized with some set of weights that depend on the data.<br />
<br />
The regularization of this works as follows, consider the surface of the objective function based on weights, due to the complexity of neural networks, this surface is going to vary significantly throughout and would contain many local minimas. Gradient descent tends to get trapped in local minimas and it can be difficult to reach a better minima with random weights. The hope is that by training the deep neural network to capture almost all of the variation of the data, the set of weights resulting from training would be near a good local minima and it could then calibrate through gradient descent to the optimal solution. This would be similar to the idea of combining PCA with some other classifier, i.e. first map the points to a subspace that is easily linearly separable then the classifier could easily classify. This can also be thought of as, once the first few layers projects the points to an easier linearly separable subspace, subsequent layers in the network can work on classifying these projected points. If these set of pre-trained weights are near a local minima, gradient descent would heavily restrict their range of values since it would travel towards the minima immediately and this restriction of values acts as a regularizer on the whole neural network.<br />
<br />
However, when the researchers tried this with some modifications to accommodate their code, it did not improve results.<br />
<br />
== Results ==<br />
<br />
For the first object of this paper that was comparing the performance of DNNs to Rf, over over 50 DNNs were trained using different parameter settings. These parameter settings were arbitrarily selected, but they attempted to cover a sufficient range of values for each adjustable parameter. Figure 1 shows the difference in <math>R^2</math> between DNNs and RF for each kaggle data set. Each column represents a QSAR data set, and each circle represents the improvement of a DNN over RF.<br />
<br />
<br />
<center><br />
[[File: fig1.PNG | frame | center |Figure 1. Overall DNN vs RF using arbitrarily selected parameter values. Each column represents a QSAR data set, and each circle represents the<br />
improvement, measured in <math>R^2</math>, of a DNN over RF ]]<br />
</center><br />
<br />
comparing the performance of different models shows that even when the worst DNN parameter setting was used for each QSAR task, the average R2 would be degraded only from 0.423 to 0.412, merely a 2.6% reduction. These results suggest that DNNs can generally outperform RF( table below).<br />
<br />
<br />
<center><br />
[[File: table1.PNG | frame | center |Table 1. comparing test <math>R^2</math> of different models ]]<br />
</center><br />
<br />
The difference in <math>R^2</math> between DNN and RF by changing the the network architecture is shown in Figure 2. In order to limit the number of different parameter combinations they fixed the number of neurons in each hidden layer. Thirty two DNNs were trained for each data set by varying number of hidden layers and number of neurons in each layer while the other key adjustable parameters were kept unchanged. It is seen that when the number of hidden layers are two, having a small number of neurons in the layers degrade the predictive capability of DNNs. It can also be seen that, given any number of hidden layers, once the number of neurons per layer is sufficiently large, increasing the number of neurons further has only a marginal benefit. In Figure 2 we can see that the neural network with only one hidden layer and 12 neurons in each layer achieved the same average predictive capability as RF . This size of neural network is indeed comparable with that of the classical neural network used in QSAR.<br />
<br />
<center><br />
[[File: fig2.PNG | frame | center |Figure 2. Impacts of Network Architecture. Each marker in the plot represents a choice of DNN network architecture. The markers share the same number of hidden layers are connected with a line. The measurement (i.e., y-axis) is the difference of the mean R2 between DNNs and RF. ]]<br />
</center><br />
<br />
To decide which activation function, Sigmoid or ReLU, performs better, at least 15 pairs of DNNs were trained for each data set. Each pair of DNNs shared the same adjustable parameter settings, except that one DNN used ReLU as the activation function, while the other used Sigmoid function. The data sets where ReLU is significantly better than Sigmoid are colored in blue, and marked at the bottom with “+”s. The difference was tested by one-sample Wilcoxon test. In contrast, the data set where Sigmoid is significantly better than ReLU is colored in black, and marked at the bottom with “−”s( Figure 3). In 53.3% (8 out of 15) data sets, ReLU is statistically significantly better than Sigmoid. Overall ReLU improves the average <math>R^2</math> over Sigmoid by 0.016. <br />
<br />
<center><br />
[[File: fig3.PNG | frame | center |Figure 3. Choice of activation functions. Each column represents a QSAR data set, and each circle represents the difference, measured in <math>R^2</math>, of a pair of<br />
DNNs trained with ReLU and Sigmoid, respectively ]]<br />
</center><br />
<br />
Figure 4 presents the difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets. Average over all data sets, there seems to joint DNN has a better performance rather single training. However, the size of the training sets plays a critical role on whether a joint DNN is beneficial. For the two very largest data sets (i.e., 3A4 and LOGD), the individual DNNs seem better, indicating that joint DNNs are more proper for not much large data sets. <br />
<br />
<center><br />
[[File: fig4.PNG | frame | center |Figure 4. difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets ]]<br />
</center><br />
<br />
The authors refine their selection of DNN adjustable parameters by studying the results of previous runs. They used the logarithmic transformation, two hidden layers, at least 250 hidden layers an activation function of ReLU. The results are shown in Figure 5. Comparison of these results with those in Figure 1 indicates that now there are 9 out of 15 data sets, whereDNNs outperforms RF even with the “worst” parameter setting, compared with 4 out of 15. The <math>R^2</math> averaged over all DNNs and all 15 data sets is 0.051 higher than that of RF.<br />
<br />
<center><br />
[[File: fig5.PNG | frame | center |Figure 5. DNN vs RF with refined parameter settings ]]<br />
</center><br />
<br />
as a conclusion for the sensitivity analysis which had been done in this work, the authors gave a recommendation on the adjustable parameters of DNNs as below:<br />
-logarithmic transformation. <br />
<br />
-four hidden layers, with number of neurons to be 4000, 2000, 1000, and 1000, respectively.<br />
<br />
-The dropout rates of 0 in the input layer, 25% in the first 3 hidden layer, and 10% in the last hidden layer.<br />
<br />
-The activation function of ReLU.<br />
<br />
-No unsupervised pretraining. The network parameters should be initialized as random values.<br />
<br />
-Large number of epochs.<br />
<br />
-Learning rate of 0.05, momentum strength of 0.9, and weight cost strength of 0.0001.<br />
<br />
To check the consistency of DNNs predictions as was one of concerns of authors, they compared the performance of RF with DNN on 15 additional QSAR data sets. Each additional data set was time-split into training and test sets in the same way as the Kaggle data sets. Individual DNNs were trained from the training set using the recommended parameters.<math>R^2</math> of the DNN and RF were calculated from the test sets. Table below presents the results for the additional data sets. It is seen that the DNN with recommended parameters outperforms RF in 13<br />
out of the 15 additional data sets. The mean <math>R^2</math> of DNNs is 0.411, while that of RFs is 0.361, which is an improvement of 13.9%.<br />
<br />
<center><br />
[[File: table2.PNG | frame | center |Comparing RF with DNN trained using recommended parameter settings on 15 additional datasets]]<br />
</center><br />
<br />
Both RF and DNN can be efficiently speeded up using high-performance computing technologies, but in a different way due to the inherent difference in their algorithms. RF can be accelerated using coarse parallelization on a cluster by giving one tree per node. In contrast, DNN can efficiently make use of the parallel computation capability of a modern GPU. With the dramatic advance in GPU hardware and increasing availability of GPU computing resources, DNN can become comparable, if not more advantageous, to RF in various aspects, including easy implementation, computation time, and hardware cost.<br />
<br />
== Discussion ==<br />
This paper demonstrate that DNN in most cases can be used as a practical QSAR method in place of RF which is now as a gold standard in the field of drug discovery. Although, the magnitude of the change in coefficient of determination relative to RF is small in some data sets, on average its better than RF. The paper recommends a set of values for all DNN algorithmic parameters, which are appropriate for large QSAR data sets in an industrial drug discovery environment. The authors gave some recommendation about how RF and DNN can be efficiently sped up using high performance computing technologies. They suggest that RF can be accelerated using coarse parallelization on a cluster by giving one tree per node. In contrast, DNN can efficiently make use of the parallel computation capability of a modern GPU. <br />
<br />
== Future Works ==<br />
<br />
In opposite of our expectation that unsupervised pretraining plays a critical role in the success of DNNs, in this study it had an inverse effect on the performance of QSAR tasks which need to be worked.<br />
Although the paper had some recommendations about the adjustable parameters of DNNs, there is still need to develop an effective and efficient strategy for refining these parameters for each particular QSAR task.<br />
The result of current paper suggested that cross-validation failed to be effective for fine-tuning the algorithmic parameters. Therefore, instead of using automatic methods for tuning DNN parameters, new approaches that can better indicate a DNN’s predictive capability in a time-split test set are needed to be developed.<br />
<br />
== Bibliography ==<br />
<references /></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26657dropout2015-11-19T21:35:18Z<p>Rtwang: /* Bayesian Neural Networks and Dropout */</p>
<hr />
<div>= Introduction =<br />
Dropout<ref>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</ref> is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p can be set using a validation set, or can be set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math> , where <math> f </math> is the activation function.<br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
Dropout can also be applied to finetune nets that have been pretrained using stacks of RBMs, autoencoders or Deep Boltzmann Machines. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. This is done by performing the regular pretraining methods (RBMs, autoencoders, ... etc). After pretraining, the weights are scaled up by factor <math> 1/p </math>, and then dropout finetuning is applied. The learning rate should be a smaller one to retain the information in the pretrained weights.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization, large decaying learning rates and high momentum provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
''' Multiplicative Gaussian Noise '''<br />
<br />
Dropout takes Bernoulli distributed random variables which take the value 1 with probability <math>p </math> and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from <math>\mathcal{N}(1, 1) </math>. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation <math>h_i </math> is perturbed to <math>h_i+h_ir </math> where <math>r \sim \mathcal{N}(0,1) </math>, which equals to <math>h_ir' </math> where <math>r' \sim \mathcal{N}(1, 1) </math>. We can generalize this to <math>r' \sim \mathcal{N}(1, \sigma^2) </math> which <math>\sigma^2</math> is a hyperparameter to tune.<br />
<br />
== Applying dropout to linear regression ==<br />
<br />
Let <math>X \in \mathbb{R}^{N\times D}</math> be a data matrix of N data points. <math>\mathbf{y}\in \mathbb{R}^N</math> be a vector of targets.Linear regression tries to find a <math>\mathbf{w}\in \mathbb{R}^D</math> that maximizes <math>\parallel \mathbf{y}-X\mathbf{w}\parallel^2</math>.<br />
<br />
When the input <math>X</math> is dropped out such that any input dimension is retained with probability <math>p</math>, the input can be expressed as <math>R*X</math> where <math>R\in \{0,1\}^{N\times D}</math> is a random matrix with <math>R_{ij}\sim Bernoulli(p)</math> and <math>*</math> denotes element-wise product. Marginalizing the noise, the objective function becomes<br />
<br />
<math>\min_{\mathbf{w}} \mathbb{E}_{R\sim Bernoulli(p)}[\parallel \mathbf{y}-(R*X)\mathbf{w}\parallel^2 ]<br />
</math><br />
<br />
This reduce to <br />
<br />
<math>\min_{\mathbf{w}} \parallel \mathbf{y}-pX\mathbf{w}\parallel^2+p(1-p)\parallel \Gamma\mathbf{w}\parallel^2<br />
</math><br />
<br />
where <math>\Gamma=(diag(X^TX))^{\frac{1}{2}}</math>. Therefore, dropout with linear regression is equivalent to ridge regression with a particular form for <math>\Gamma</math>. This form of <math>\Gamma</math> essentially scales the weight cost for weight <math>w_i</math> by the standard deviation of the <math>i</math>th dimension of the data. If a particular data dimension varies a lot, the regularizer tries to squeeze its weight more.<br />
<br />
== Bayesian Neural Networks and Dropout ==<br />
<br />
For some data set <math>\,{(x_i,y_i)}^n_{i=1}</math>, the Bayesian approach to estimating <math>\,y_{n+1}</math> given <math>\,x_{n+1}</math> is to pick some prior distribution, <math>\,P(\theta)</math>, and assign probabilities for <math>\,y_{n+1}</math> using the posterior distribution based on the prior distribution and the data set. <br />
<br />
The general formula is:<br />
<br />
<math>\,P(y_{n+1}|y_1,\dots,y_n,x_1,\dots,x_n,x_{n+1})=\int P(y_{n+1}|x_{n+1},\theta)P(\theta|y_1,\dots,y_n,x_1,\dots,x_n)d\theta</math><br />
<br />
To obtain a prediction, it is common to take the expected value of this distribution to get the formula:<br />
<br />
<math>\,\hat y_{n+1}=\int y_{n+1}P(y_{n+1}|x_{n+1},\theta)P(\theta|y_1,\dots,y_n,x_1,\dots,x_n)d\theta</math><br />
<br />
This formula can be applied to a neural network by thinking of <math>\,\theta</math> as all of the parameters in the neural network and <math>\,P(y_{n+1}|x_{n+1},\theta)</math> can be thought as the output of the neural network given some set of weights and the input. Since the output of a neural network is fixed and the probability is 1 for a single output and 0 for all other possible outputs, the formula can be rewritten as:<br />
<br />
<math>\,\hat y_{n+1}=\int f(x_{n+1},\theta)P(\theta|y_1,\dots,y_n,x_1,\dots,x_n)d\theta</math><br />
<br />
Where <math>\,f(x_{n+1},\theta)</math> is the output of the neural network given some weights and input. By taking a closer look at this expected values formula, it is essentially the average of infinitely many possible neural network outputs weighted by its probability of occurring given the data set.<br />
<br />
In the dropout model, the researchers are doing something very similar in that they take the average of the outputs of a wide variety of models with different weights but unlike Bayesian neural networks where each of these outputs and their respective models are weighted by their proper probability of occurring, the dropout model assigns equal probability to each model. This necessarily impacts the accuracy of dropout neural networks compared to Bayesian neural networks but have very strong advantages in training speed and ability to scale.<br />
<br />
Despite the erroneous probability weighting compared to Bayesian neural networks, the researchers compared the two models and found that while it is less accurate, it is still better than standard neural network models and can be seen in their chart below, higher is better:<br />
<br />
[[File:BNN.PNG]]<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
In a standard neural network, units may change in a way that they fix up the mistakes of the other units, which may lead to complex co-adaptations and overfitting because these co-adaptations do not generalize to unseen data. Dropout breaks the co-adaptations between hidden units by making the presence of other units unreliable. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[File:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
[[File:Result.png]]<br />
<br />
In order to test the robustness of dropout, they did classification experiments with networks of many different architectures keeping all hyperparameters fixed. The figure below shows the test error rates obtained for these different architectures as training progresses. Dropout gives a huge improvement across all architectures.<br />
<br />
[[File:dropout.PNG]]<br />
<br />
The author also apply dropout scheme on many neural networks and test on different datasets, such as Street View House Numbers (SVHN), CIFAR, ImageNet and TIMIT dataset. Adding dropout can always reduce the error rate and further improve the performance of neural networks.<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.<br />
<br />
=Reference=<br />
<references /></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=dropout&diff=26656dropout2015-11-19T21:34:20Z<p>Rtwang: /* Model */</p>
<hr />
<div>= Introduction =<br />
Dropout<ref>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</ref> is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks. <br />
<br />
[[File:intro.png]]<br />
<br />
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p can be set using a validation set, or can be set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).<br />
<br />
= Model =<br />
<br />
Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:<br />
<br />
:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math><br />
<br />
:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product. <br />
<br />
:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math><br />
<br />
:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math> , where <math> f </math> is the activation function.<br />
<br />
<br />
<br />
For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.<br />
<br />
'''Backpropagation in Dropout Case (Training)'''<br />
<br />
Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.<br />
<br />
Dropout can also be applied to finetune nets that have been pretrained using stacks of RBMs, autoencoders or Deep Boltzmann Machines. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. This is done by performing the regular pretraining methods (RBMs, autoencoders, ... etc). After pretraining, the weights are scaled up by factor <math> 1/p </math>, and then dropout finetuning is applied. The learning rate should be a smaller one to retain the information in the pretrained weights.<br />
<br />
''' Max-norm Regularization '''<br />
<br />
Using dropout along with max-norm regularization, large decaying learning rates and high momentum provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. <br />
<br />
'''Test Time'''<br />
<br />
Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.<br />
<br />
[[File:test.png]]<br />
<br />
''' Multiplicative Gaussian Noise '''<br />
<br />
Dropout takes Bernoulli distributed random variables which take the value 1 with probability <math>p </math> and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from <math>\mathcal{N}(1, 1) </math>. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation <math>h_i </math> is perturbed to <math>h_i+h_ir </math> where <math>r \sim \mathcal{N}(0,1) </math>, which equals to <math>h_ir' </math> where <math>r' \sim \mathcal{N}(1, 1) </math>. We can generalize this to <math>r' \sim \mathcal{N}(1, \sigma^2) </math> which <math>\sigma^2</math> is a hyperparameter to tune.<br />
<br />
== Applying dropout to linear regression ==<br />
<br />
Let <math>X \in \mathbb{R}^{N\times D}</math> be a data matrix of N data points. <math>\mathbf{y}\in \mathbb{R}^N</math> be a vector of targets.Linear regression tries to find a <math>\mathbf{w}\in \mathbb{R}^D</math> that maximizes <math>\parallel \mathbf{y}-X\mathbf{w}\parallel^2</math>.<br />
<br />
When the input <math>X</math> is dropped out such that any input dimension is retained with probability <math>p</math>, the input can be expressed as <math>R*X</math> where <math>R\in \{0,1\}^{N\times D}</math> is a random matrix with <math>R_{ij}\sim Bernoulli(p)</math> and <math>*</math> denotes element-wise product. Marginalizing the noise, the objective function becomes<br />
<br />
<math>\min_{\mathbf{w}} \mathbb{E}_{R\sim Bernoulli(p)}[\parallel \mathbf{y}-(R*X)\mathbf{w}\parallel^2 ]<br />
</math><br />
<br />
This reduce to <br />
<br />
<math>\min_{\mathbf{w}} \parallel \mathbf{y}-pX\mathbf{w}\parallel^2+p(1-p)\parallel \Gamma\mathbf{w}\parallel^2<br />
</math><br />
<br />
where <math>\Gamma=(diag(X^TX))^{\frac{1}{2}}</math>. Therefore, dropout with linear regression is equivalent to ridge regression with a particular form for <math>\Gamma</math>. This form of <math>\Gamma</math> essentially scales the weight cost for weight <math>w_i</math> by the standard deviation of the <math>i</math>th dimension of the data. If a particular data dimension varies a lot, the regularizer tries to squeeze its weight more.<br />
<br />
== Bayesian Neural Networks and Dropout ==<br />
<br />
For some data set <math>\,{(x_i,y_i)}^n_{i=1}</math>, the Bayesian approach to estimating <math>\,y_{n+1}</math> given <math>\,x_{n+1}</math> is to pick some prior distribution, <math>\,P(\theta)</math>, and assign probabilities for <math>\,y_{n+1}</math> using the posterior distribution based on the prior distribution and the data set. <br />
<br />
The general formula is:<br />
<br />
<math>\,P(y_{n+1}|y_1,\dots,y_n,x_1,\dots,x_n,x_{n+1})=\int P(y_{n+1}|x_{n+1},\theta)P(\theta|y_1,\dots,y_n,x_1,\dots,x_n)d\theta</math><br />
<br />
To obtain a prediction, it is common to take the expected value of this distribution to get the formula:<br />
<br />
<math>\,\hat y_{n+1}=\int y_{n+1}P(y_{n+1}|x_{n+1},\theta)P(\theta|y_1,\dots,y_n,x_1,\dots,x_n)d\theta</math><br />
<br />
This formula can be applied to a neural network by thinking of <math>\,\theta</math> as all of the parameters in the neural network and <math>\,P(y_{n+1}|x_{n+1},\theta)</math> can be thought as the output of the neural network given some set of weights and the input. Since the output a neural network is fixed and the probability is 1 for a single output and 0 for all other possible outputs, the formula can be rewritten as:<br />
<br />
<math>\,\hat y_{n+1}=\int f(x_{n+1},\theta)P(\theta|y_1,\dots,y_n,x_1,\dots,x_n)d\theta</math><br />
<br />
Where <math>\,f(x_{n+1},\theta)</math> is the output of the neural network given some weights and input. By taking a closer look at this expected values formula, it is essentially the average of infinitely many possible neural network outputs weighted by its probability of occurring given the data set.<br />
<br />
In the dropout model, the researchers are doing something very similar in that they take the average of the outputs of a wide variety of models with different weights but unlike Bayesian neural networks where each of these outputs and their respective models are weighted by their proper probability of occurring, the dropout model assigns equal probability to each model. This necessarily impacts the accuracy of dropout neural networks compared to Bayesian neural networks but have very strong advantages in training speed and ability to scale.<br />
<br />
Despite the erroneous probability weighting compared to Bayesian neural networks, the researchers compared the two models and found that while it is less accurate, it is still better than standard neural network models and can be seen in their chart below, higher is better:<br />
<br />
[[File:BNN.PNG]]<br />
<br />
= Effects of Dropout =<br />
<br />
''' Effect on Features '''<br />
<br />
In a standard neural network, units may change in a way that they fix up the mistakes of the other units, which may lead to complex co-adaptations and overfitting because these co-adaptations do not generalize to unseen data. Dropout breaks the co-adaptations between hidden units by making the presence of other units unreliable. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. <br />
[[File:feature.png]]<br />
<br />
''' Effect on Sparsity '''<br />
<br />
Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.<br />
[[File:sparsity.png]]<br />
<br />
'''Effect of Dropout Rate'''<br />
<br />
The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:<br />
1. The number of hidden units is held constant. (fixed n)<br />
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )<br />
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. <br />
[[File:pvalue.png]]<br />
<br />
'''Effect of Data Set Size'''<br />
<br />
This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. <br />
<br />
[[File:Datasize.png]]<br />
<br />
= Comparison =<br />
<br />
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:<br />
<br />
[[File:Comparison.png]]<br />
<br />
= Result =<br />
<br />
The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate. <br />
<br />
[[File:Result.png]]<br />
<br />
In order to test the robustness of dropout, they did classification experiments with networks of many different architectures keeping all hyperparameters fixed. The figure below shows the test error rates obtained for these different architectures as training progresses. Dropout gives a huge improvement across all architectures.<br />
<br />
[[File:dropout.PNG]]<br />
<br />
The author also apply dropout scheme on many neural networks and test on different datasets, such as Street View House Numbers (SVHN), CIFAR, ImageNet and TIMIT dataset. Adding dropout can always reduce the error rate and further improve the performance of neural networks.<br />
<br />
=Conclusion=<br />
<br />
Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.<br />
<br />
=Reference=<br />
<references /></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:BNN.PNG&diff=26655File:BNN.PNG2015-11-19T21:34:02Z<p>Rtwang: </p>
<hr />
<div></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=continuous_space_language_models&diff=26536continuous space language models2015-11-18T22:08:43Z<p>Rtwang: /* Sorting and Bunch */</p>
<hr />
<div>= Introduction =<br />
In certain fields of study such as speech recognition or machine translation, for some acoustic signal <math>\,x</math> or the source sentence to be translated <math>\,e</math>, it is common to model these problems as finding the sequence of words <math>\,w^*</math> that has the highest probability of occurring given <math>\,x</math> or <math>\,e</math>. This can be written as:<br />
<br />
<math>w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w)</math><br />
<br />
An acoustic or translation model can then be used for <math>\,P(x|w)</math>, similar to the idea behind LDA and QDA, and it remains to create a language model <math>\,P(w)</math> to estimate the probability of any sequence of words <math>\,w</math>.<br />
<br />
This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate <math>\,P(w)</math>.<br />
<br />
= Back-off n-grams Model =<br />
<br />
A sequence of words will be defined as <math>\,w^i_1=(w_1,w_2,\dots,w_i)</math> and the formula for the probability <math>\,P(w)</math> can be rewritten as:<br />
<br />
<math>P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1)</math><br />
<br />
It is common to estimate <math>\,P(w_i|w^{i-1}_1)</math> through:<br />
<br />
<math>\,P(w_i|w^{i-1}_1)\approx\frac{\mbox{number of occurrence of the sequence} (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence} (w_1,\dots,w_{i-1})}</math><br />
<br />
However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.<br />
<br />
To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:<br />
<br />
<math>\,P(w_i|w^{i-1}_1) = \begin{cases} <br />
\frac{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_{i-1})}, & \mbox{if number of occurrence of}\ (w_1,\dots,w_i)\ \mbox{is greater than some constant K} \\<br />
\alpha P(w_i|w^{i-1}_2), & \mbox{otherwise} <br />
\end{cases}</math><br />
<br />
<math>\,\alpha</math> is typically a discounting factor that is less than 1 to account for the lack of direct data. It usually depends on the word sequence.<br />
<br />
The general algorithm is then, if the data set does contain the sequence then calculate probability directly. Otherwise, apply a discounting factor and calculate the conditional probability with the first word in the sequence removed. For example, if the word sequence was "The dog barked" and it did not exist in the training set then the formula would be written as:<br />
<br />
<math>\,P(\mbox{barked}|\mbox{the,dog}) \approx \alpha P(\mbox{barked}|\mbox{dog})</math><br />
<br />
= Model =<br />
<br />
The researchers for this paper sought to find a better model for this probability than the back-off n-grams model. Their approach was to map the n-1 words sequence onto a multi-dimension continuous space using a layer of neural network followed by another layer to estimate the probabilities of all possible next words. The formulas and model goes as follows:<br />
<br />
For some sequence of n-1 words, encode each word using 1 of K encoding, i.e. 1 where the word is indexed and zero everywhere else. Label each 1 of K encoding by <math>(w_{j-n+1},\dots,w_j)</math> for some n-1 word sequence at the j'th word in some larger context.<br />
<br />
Let P be a projection matrix common to all n-1 words and let<br />
<br />
<math>\,a_i=Pw_{j-n+i},i=1,\dots,n-1</math><br />
<br />
Let H be the weight matrix from the projection layer to the hidden layer and the state of H would be:<br />
<br />
<math>\,h=tanh(Ha + b)</math> where A is the concatenation of all <math>\,a_i</math> and <math>\,b</math> is some bias vector<br />
<br />
Finally, the output vector would be:<br />
<br />
<math>\,o=Vh+k</math> where V is the weight matrix from hidden to output and k is another bias vector. <math>\,o</math> would be a vector with same dimensions as the total vocabulary size and the probabilities can be calculated from <math>\,o</math> by applying the softmax function.<br />
<br />
= Optimization and Training =<br />
The training was done with standard back-propagation on minimizing the error function:<br />
<br />
<math>\,E=\sum_{i=1}^N t_i\ log p_i + \epsilon(\sum_{i,j}h^2_{ij}+\sum_{i,j}v^2_{ij})</math><br />
<br />
<math>\,t_i</math> is the desired output vector and the summations inside the epsilon bracket are regularization terms to prevent overfitting of <math>\,H</math> and <math>\,V</math>.<br />
<br />
The researchers used stochastic gradient descent to prevent having to sum over millions of examples worth of error and this sped up training time.<br />
<br />
An issue the researchers ran into using this model was that it took a long time to calculate language model probabilities compared to traditional back-off n-grams model and reduced its suitability for real time predictions. To solve this issue, several optimization techniques were used.<br />
<br />
===Lattice rescoring===<br />
<br />
It is common to keep track of additional possible solutions instead of just the most obviously likely solution in a lattice structure, i.e. a tree like structure where branches can merge and each branch represents a possible solution. For example from the paper using a tri-gram model, i.e. predict third word from first two words, the following lattice structure was formed:<br />
<br />
[[File:Lattice.PNG]]<br />
<br />
Any particular branch where two nodes have the same words can be merged. For example, "a,problem" was merged in the middle of the lattice because the tri-gram model would estimate the same probability at the point for both branch. Similary, "that_is,not" and "there_is,not" cannot be merged before the preceding two words to predict with are different.<br />
<br />
After this structure is created with a traditional back-off n-grams model, the neural network is then used to re-score the lattice and the re-scored lattice is used to make predictions.<br />
<br />
===Short List===<br />
<br />
In any language, there is usually a small set of commonly used words that form almost all of written or spoken thought. The short-list idea is that rather than calculating every single probability for even the rarest words, the neural network only calculates a small subset of the most common words. This way, the output vector can be significantly shrunk from <math>\,\mbox{N}</math> to some much smaller number <math>\,\mbox{S}</math>.<br />
<br />
If any rare words do occur, their probabilities are calculated using the traditional back-off n-grams model. The formula then goes as follows from the paper:<br />
<br />
[[File:shortlist.PNG]]<br />
<br />
Where L is the event that <math>\,w_t</math> is in the short-list.<br />
<br />
===Sorting and Bunch===<br />
<br />
The neural network predicts all the probabilities based on some sequence of words. If the probability of two different sequences of words are required but their relationship is such that for sequence 1, <math>\,w=(w_1,\dots,w_{i-1},w_i)</math> and sequence 2, <math>\,w^'=(w_1,\dots,w_{i-1},w^'_i)</math>, they differ only in the last word. Then only a single feed through the neural network is required. This is because the output vector using the context <math>\,(w_1,\dots,w_{i-1})</math> would predict the probabilities for both <math>\,w_i</math> and <math>\,w^'_i</math> being next. Therefore it is efficient to merge any sequence who have the same context.<br />
<br />
Modern day computers are also very optimized for linear algebra and it is more efficient to run multiple examples at the same time through the matrix equations. The researchers called this bunching and simple testing showed that this decreased processing time by a factor of 10 when using 128 examples at once compared to 1.<br />
<br />
= Training and Usage =<br />
<br />
The researchers used numerous optimization techniques during training and their results were summarized in the paper as follows:<br />
<br />
[[File:fast_training.PNG]]<br />
<br />
Since the model only trains to predict based on the last n-1 words, at certain points there will be less than n-1 words and adjustments must be made. The researchers considered two possibilities, using traditional models for these n-grams or filling up the n-k words with some filler word up to n-1. After some testing, they found that requests for small n-gram probabilities were pretty low and they decided to use traditional back-off n-gram model for these cases.<br />
<br />
= Results =<br />
<br />
In general the results were quite good. When this neural network + back-off n-grams hybrid was used in combination with a number of acoustic speech recognition models, they found that perplexity, lower the better, decreased by about 10% in a number of cases compared with traditional back-off n-grams only model. Some of their results are summarized as follows:<br />
<br />
[[File:results1.PNG]]<br />
<br />
[[File:results2.PNG]]<br />
<br />
= Source =<br />
Schwenk, H. Continuous space language models. Computer Speech<br />
Lang. 21, 492–518 (2007). ISIArticle</div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=continuous_space_language_models&diff=26535continuous space language models2015-11-18T22:07:39Z<p>Rtwang: /* Back-off n-grams Model */</p>
<hr />
<div>= Introduction =<br />
In certain fields of study such as speech recognition or machine translation, for some acoustic signal <math>\,x</math> or the source sentence to be translated <math>\,e</math>, it is common to model these problems as finding the sequence of words <math>\,w^*</math> that has the highest probability of occurring given <math>\,x</math> or <math>\,e</math>. This can be written as:<br />
<br />
<math>w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w)</math><br />
<br />
An acoustic or translation model can then be used for <math>\,P(x|w)</math>, similar to the idea behind LDA and QDA, and it remains to create a language model <math>\,P(w)</math> to estimate the probability of any sequence of words <math>\,w</math>.<br />
<br />
This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate <math>\,P(w)</math>.<br />
<br />
= Back-off n-grams Model =<br />
<br />
A sequence of words will be defined as <math>\,w^i_1=(w_1,w_2,\dots,w_i)</math> and the formula for the probability <math>\,P(w)</math> can be rewritten as:<br />
<br />
<math>P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1)</math><br />
<br />
It is common to estimate <math>\,P(w_i|w^{i-1}_1)</math> through:<br />
<br />
<math>\,P(w_i|w^{i-1}_1)\approx\frac{\mbox{number of occurrence of the sequence} (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence} (w_1,\dots,w_{i-1})}</math><br />
<br />
However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.<br />
<br />
To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:<br />
<br />
<math>\,P(w_i|w^{i-1}_1) = \begin{cases} <br />
\frac{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_{i-1})}, & \mbox{if number of occurrence of}\ (w_1,\dots,w_i)\ \mbox{is greater than some constant K} \\<br />
\alpha P(w_i|w^{i-1}_2), & \mbox{otherwise} <br />
\end{cases}</math><br />
<br />
<math>\,\alpha</math> is typically a discounting factor that is less than 1 to account for the lack of direct data. It usually depends on the word sequence.<br />
<br />
The general algorithm is then, if the data set does contain the sequence then calculate probability directly. Otherwise, apply a discounting factor and calculate the conditional probability with the first word in the sequence removed. For example, if the word sequence was "The dog barked" and it did not exist in the training set then the formula would be written as:<br />
<br />
<math>\,P(\mbox{barked}|\mbox{the,dog}) \approx \alpha P(\mbox{barked}|\mbox{dog})</math><br />
<br />
= Model =<br />
<br />
The researchers for this paper sought to find a better model for this probability than the back-off n-grams model. Their approach was to map the n-1 words sequence onto a multi-dimension continuous space using a layer of neural network followed by another layer to estimate the probabilities of all possible next words. The formulas and model goes as follows:<br />
<br />
For some sequence of n-1 words, encode each word using 1 of K encoding, i.e. 1 where the word is indexed and zero everywhere else. Label each 1 of K encoding by <math>(w_{j-n+1},\dots,w_j)</math> for some n-1 word sequence at the j'th word in some larger context.<br />
<br />
Let P be a projection matrix common to all n-1 words and let<br />
<br />
<math>\,a_i=Pw_{j-n+i},i=1,\dots,n-1</math><br />
<br />
Let H be the weight matrix from the projection layer to the hidden layer and the state of H would be:<br />
<br />
<math>\,h=tanh(Ha + b)</math> where A is the concatenation of all <math>\,a_i</math> and <math>\,b</math> is some bias vector<br />
<br />
Finally, the output vector would be:<br />
<br />
<math>\,o=Vh+k</math> where V is the weight matrix from hidden to output and k is another bias vector. <math>\,o</math> would be a vector with same dimensions as the total vocabulary size and the probabilities can be calculated from <math>\,o</math> by applying the softmax function.<br />
<br />
= Optimization and Training =<br />
The training was done with standard back-propagation on minimizing the error function:<br />
<br />
<math>\,E=\sum_{i=1}^N t_i\ log p_i + \epsilon(\sum_{i,j}h^2_{ij}+\sum_{i,j}v^2_{ij})</math><br />
<br />
<math>\,t_i</math> is the desired output vector and the summations inside the epsilon bracket are regularization terms to prevent overfitting of <math>\,H</math> and <math>\,V</math>.<br />
<br />
The researchers used stochastic gradient descent to prevent having to sum over millions of examples worth of error and this sped up training time.<br />
<br />
An issue the researchers ran into using this model was that it took a long time to calculate language model probabilities compared to traditional back-off n-grams model and reduced its suitability for real time predictions. To solve this issue, several optimization techniques were used.<br />
<br />
===Lattice rescoring===<br />
<br />
It is common to keep track of additional possible solutions instead of just the most obviously likely solution in a lattice structure, i.e. a tree like structure where branches can merge and each branch represents a possible solution. For example from the paper using a tri-gram model, i.e. predict third word from first two words, the following lattice structure was formed:<br />
<br />
[[File:Lattice.PNG]]<br />
<br />
Any particular branch where two nodes have the same words can be merged. For example, "a,problem" was merged in the middle of the lattice because the tri-gram model would estimate the same probability at the point for both branch. Similary, "that_is,not" and "there_is,not" cannot be merged before the preceding two words to predict with are different.<br />
<br />
After this structure is created with a traditional back-off n-grams model, the neural network is then used to re-score the lattice and the re-scored lattice is used to make predictions.<br />
<br />
===Short List===<br />
<br />
In any language, there is usually a small set of commonly used words that form almost all of written or spoken thought. The short-list idea is that rather than calculating every single probability for even the rarest words, the neural network only calculates a small subset of the most common words. This way, the output vector can be significantly shrunk from <math>\,\mbox{N}</math> to some much smaller number <math>\,\mbox{S}</math>.<br />
<br />
If any rare words do occur, their probabilities are calculated using the traditional back-off n-grams model. The formula then goes as follows from the paper:<br />
<br />
[[File:shortlist.PNG]]<br />
<br />
Where L is the event that <math>\,w_t</math> is in the short-list.<br />
<br />
===Sorting and Bunch===<br />
<br />
The neural network predicts all the probabilities based on some sequence of words. If the probability of two different sequences of words are required but their relationship is such that for sequence 1, <math>\,w=(w_1,\dots,w_{i-1},w_i)</math> and sequence 2, <math>\,w^'=(w_1,\dots,w_{i-1},w^'_i)</math>, they differ only in the last word. Then only a single feed through the neural network is required. This is because the output vector using the context <math>\,(w_1,\dots,w_{i-1})</math> would predict the probabilities for both <math>\,w_i</math> and <math>\,w^'_i</math> being next. Therefore it is efficient to merge any sequence who have the same context.<br />
<br />
Modern day computers are very optimized for linear algebra and it is more efficient to run multiple examples at the same time through the matrix equations. The researchers called this bunching and simple testing showed that this decreased processing time by a factor of 10 when using 128 examples at once compared to 1.<br />
<br />
= Training and Usage =<br />
<br />
The researchers used numerous optimization techniques during training and their results were summarized in the paper as follows:<br />
<br />
[[File:fast_training.PNG]]<br />
<br />
Since the model only trains to predict based on the last n-1 words, at certain points there will be less than n-1 words and adjustments must be made. The researchers considered two possibilities, using traditional models for these n-grams or filling up the n-k words with some filler word up to n-1. After some testing, they found that requests for small n-gram probabilities were pretty low and they decided to use traditional back-off n-gram model for these cases.<br />
<br />
= Results =<br />
<br />
In general the results were quite good. When this neural network + back-off n-grams hybrid was used in combination with a number of acoustic speech recognition models, they found that perplexity, lower the better, decreased by about 10% in a number of cases compared with traditional back-off n-grams only model. Some of their results are summarized as follows:<br />
<br />
[[File:results1.PNG]]<br />
<br />
[[File:results2.PNG]]<br />
<br />
= Source =<br />
Schwenk, H. Continuous space language models. Computer Speech<br />
Lang. 21, 492–518 (2007). ISIArticle</div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=continuous_space_language_models&diff=26534continuous space language models2015-11-18T22:06:52Z<p>Rtwang: </p>
<hr />
<div>= Introduction =<br />
In certain fields of study such as speech recognition or machine translation, for some acoustic signal <math>\,x</math> or the source sentence to be translated <math>\,e</math>, it is common to model these problems as finding the sequence of words <math>\,w^*</math> that has the highest probability of occurring given <math>\,x</math> or <math>\,e</math>. This can be written as:<br />
<br />
<math>w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w)</math><br />
<br />
An acoustic or translation model can then be used for <math>\,P(x|w)</math>, similar to the idea behind LDA and QDA, and it remains to create a language model <math>\,P(w)</math> to estimate the probability of any sequence of words <math>\,w</math>.<br />
<br />
This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate <math>\,P(w)</math>.<br />
<br />
= Back-off n-grams Model =<br />
<br />
A sequence of words will be defined as <math>\,w^i_1=(w_1,w_2,\dots,w_i)</math> and the formula for the probability <math>\,P(w)</math> can be rewritten as:<br />
<br />
<math>P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1)</math><br />
<br />
It is common to estimate <math>\,P(w_i|w^{i-1}_1)</math> through:<br />
<br />
<math>\,P(w_i|w^{i-1}_1)\approx\frac{\mbox{number of occurrence of the sequence} (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence} (w_1,\dots,w_{i-1})}</math><br />
<br />
However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.<br />
<br />
To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:<br />
<br />
<math>\,P(w_i|w^{i-1}_1) = \begin{cases} <br />
\frac{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_{i-1})}, & \mbox{if number of occurrence of}\ (w_1,\dots,w_i)\ \mbox{is greater than some constant K} \\<br />
\alpha P(w_i|w^{i-1}_2), & \mbox{otherwise} <br />
\end{cases}</math><br />
<br />
<math>\,\alpha</math> is typically a discounting factor that is less than 1 to account for the lack of direct data. It usually depends on the word sequence.<br />
<br />
The general algorithm is then, if the data set does contain the sequence then calculate probability directly. Otherwise, apply a discounting factor and calculate the conditional probability with the first word in the sequence removed. For example, if the word sequence was "The dog barked" and it did not exist in the training set then the formula would be written as:<br />
<br />
<math>\,P(\mbox{the,dog,barked}|\mbox{the,dog}) \approx \alpha P(\mbox{dog,barked}|\mbox{dog})</math><br />
<br />
= Model =<br />
<br />
The researchers for this paper sought to find a better model for this probability than the back-off n-grams model. Their approach was to map the n-1 words sequence onto a multi-dimension continuous space using a layer of neural network followed by another layer to estimate the probabilities of all possible next words. The formulas and model goes as follows:<br />
<br />
For some sequence of n-1 words, encode each word using 1 of K encoding, i.e. 1 where the word is indexed and zero everywhere else. Label each 1 of K encoding by <math>(w_{j-n+1},\dots,w_j)</math> for some n-1 word sequence at the j'th word in some larger context.<br />
<br />
Let P be a projection matrix common to all n-1 words and let<br />
<br />
<math>\,a_i=Pw_{j-n+i},i=1,\dots,n-1</math><br />
<br />
Let H be the weight matrix from the projection layer to the hidden layer and the state of H would be:<br />
<br />
<math>\,h=tanh(Ha + b)</math> where A is the concatenation of all <math>\,a_i</math> and <math>\,b</math> is some bias vector<br />
<br />
Finally, the output vector would be:<br />
<br />
<math>\,o=Vh+k</math> where V is the weight matrix from hidden to output and k is another bias vector. <math>\,o</math> would be a vector with same dimensions as the total vocabulary size and the probabilities can be calculated from <math>\,o</math> by applying the softmax function.<br />
<br />
= Optimization and Training =<br />
The training was done with standard back-propagation on minimizing the error function:<br />
<br />
<math>\,E=\sum_{i=1}^N t_i\ log p_i + \epsilon(\sum_{i,j}h^2_{ij}+\sum_{i,j}v^2_{ij})</math><br />
<br />
<math>\,t_i</math> is the desired output vector and the summations inside the epsilon bracket are regularization terms to prevent overfitting of <math>\,H</math> and <math>\,V</math>.<br />
<br />
The researchers used stochastic gradient descent to prevent having to sum over millions of examples worth of error and this sped up training time.<br />
<br />
An issue the researchers ran into using this model was that it took a long time to calculate language model probabilities compared to traditional back-off n-grams model and reduced its suitability for real time predictions. To solve this issue, several optimization techniques were used.<br />
<br />
===Lattice rescoring===<br />
<br />
It is common to keep track of additional possible solutions instead of just the most obviously likely solution in a lattice structure, i.e. a tree like structure where branches can merge and each branch represents a possible solution. For example from the paper using a tri-gram model, i.e. predict third word from first two words, the following lattice structure was formed:<br />
<br />
[[File:Lattice.PNG]]<br />
<br />
Any particular branch where two nodes have the same words can be merged. For example, "a,problem" was merged in the middle of the lattice because the tri-gram model would estimate the same probability at the point for both branch. Similary, "that_is,not" and "there_is,not" cannot be merged before the preceding two words to predict with are different.<br />
<br />
After this structure is created with a traditional back-off n-grams model, the neural network is then used to re-score the lattice and the re-scored lattice is used to make predictions.<br />
<br />
===Short List===<br />
<br />
In any language, there is usually a small set of commonly used words that form almost all of written or spoken thought. The short-list idea is that rather than calculating every single probability for even the rarest words, the neural network only calculates a small subset of the most common words. This way, the output vector can be significantly shrunk from <math>\,\mbox{N}</math> to some much smaller number <math>\,\mbox{S}</math>.<br />
<br />
If any rare words do occur, their probabilities are calculated using the traditional back-off n-grams model. The formula then goes as follows from the paper:<br />
<br />
[[File:shortlist.PNG]]<br />
<br />
Where L is the event that <math>\,w_t</math> is in the short-list.<br />
<br />
===Sorting and Bunch===<br />
<br />
The neural network predicts all the probabilities based on some sequence of words. If the probability of two different sequences of words are required but their relationship is such that for sequence 1, <math>\,w=(w_1,\dots,w_{i-1},w_i)</math> and sequence 2, <math>\,w^'=(w_1,\dots,w_{i-1},w^'_i)</math>, they differ only in the last word. Then only a single feed through the neural network is required. This is because the output vector using the context <math>\,(w_1,\dots,w_{i-1})</math> would predict the probabilities for both <math>\,w_i</math> and <math>\,w^'_i</math> being next. Therefore it is efficient to merge any sequence who have the same context.<br />
<br />
Modern day computers are very optimized for linear algebra and it is more efficient to run multiple examples at the same time through the matrix equations. The researchers called this bunching and simple testing showed that this decreased processing time by a factor of 10 when using 128 examples at once compared to 1.<br />
<br />
= Training and Usage =<br />
<br />
The researchers used numerous optimization techniques during training and their results were summarized in the paper as follows:<br />
<br />
[[File:fast_training.PNG]]<br />
<br />
Since the model only trains to predict based on the last n-1 words, at certain points there will be less than n-1 words and adjustments must be made. The researchers considered two possibilities, using traditional models for these n-grams or filling up the n-k words with some filler word up to n-1. After some testing, they found that requests for small n-gram probabilities were pretty low and they decided to use traditional back-off n-gram model for these cases.<br />
<br />
= Results =<br />
<br />
In general the results were quite good. When this neural network + back-off n-grams hybrid was used in combination with a number of acoustic speech recognition models, they found that perplexity, lower the better, decreased by about 10% in a number of cases compared with traditional back-off n-grams only model. Some of their results are summarized as follows:<br />
<br />
[[File:results1.PNG]]<br />
<br />
[[File:results2.PNG]]<br />
<br />
= Source =<br />
Schwenk, H. Continuous space language models. Computer Speech<br />
Lang. 21, 492–518 (2007). ISIArticle</div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Results2.PNG&diff=26533File:Results2.PNG2015-11-18T22:06:08Z<p>Rtwang: </p>
<hr />
<div></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Results1.PNG&diff=26532File:Results1.PNG2015-11-18T22:05:52Z<p>Rtwang: </p>
<hr />
<div></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Fast_training.PNG&diff=26531File:Fast training.PNG2015-11-18T21:57:02Z<p>Rtwang: </p>
<hr />
<div></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Shortlist.PNG&diff=26530File:Shortlist.PNG2015-11-18T21:44:03Z<p>Rtwang: </p>
<hr />
<div></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=continuous_space_language_models&diff=26529continuous space language models2015-11-18T21:36:13Z<p>Rtwang: </p>
<hr />
<div>= Introduction =<br />
In certain fields of study such as speech recognition or machine translation, for some acoustic signal <math>\,x</math> or the source sentence to be translated <math>\,e</math>, it is common to model these problems as finding the sequence of words <math>\,w^*</math> that has the highest probability of occurring given <math>\,x</math> or <math>\,e</math>. This can be written as:<br />
<br />
<math>w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w)</math><br />
<br />
An acoustic or translation model can then be used for <math>\,P(x|w)</math>, similar to the idea behind LDA and QDA, and it remains to create a language model <math>\,P(w)</math> to estimate the probability of any sequence of words <math>\,w</math>.<br />
<br />
This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate <math>\,P(w)</math>.<br />
<br />
= Back-off n-grams Model =<br />
<br />
A sequence of words will be defined as <math>\,w^i_1=(w_1,w_2,\dots,w_i)</math> and the formula for the probability <math>\,P(w)</math> can be rewritten as:<br />
<br />
<math>P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1)</math><br />
<br />
It is common to estimate <math>\,P(w_i|w^{i-1}_1)</math> through:<br />
<br />
<math>\,P(w_i|w^{i-1}_1)\approx\frac{\mbox{number of occurrence of the sequence} (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence} (w_1,\dots,w_{i-1})}</math><br />
<br />
However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.<br />
<br />
To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:<br />
<br />
<math>\,P(w_i|w^{i-1}_1) = \begin{cases} <br />
\frac{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_{i-1})}, & \mbox{if number of occurrence of}\ (w_1,\dots,w_i)\ \mbox{is greater than some constant K} \\<br />
\alpha P(w_i|w^{i-1}_2), & \mbox{otherwise} <br />
\end{cases}</math><br />
<br />
<math>\,\alpha</math> is typically a discounting factor that is less than 1 to account for the lack of direct data. It usually depends on the word sequence.<br />
<br />
The general algorithm is then, if the data set does contain the sequence then calculate probability directly. Otherwise, apply a discounting factor and calculate the conditional probability with the first word in the sequence removed. For example, if the word sequence was "The dog barked" and it did not exist in the training set then the formula would be written as:<br />
<br />
<math>\,P(\mbox{the,dog,barked}|\mbox{the,dog}) \approx \alpha P(\mbox{dog,barked}|\mbox{dog})</math><br />
<br />
= Model =<br />
<br />
The researchers for this paper sought to find a better model for this probability than the back-off n-grams model. Their approach was to map the n-1 words sequence onto a multi-dimension continuous space using a layer of neural network followed by another layer to estimate the probabilities of all possible next words. The formulas and model goes as follows:<br />
<br />
For some sequence of n-1 words, encode each word using 1 of K encoding, i.e. 1 where the word is indexed and zero everywhere else. Label each 1 of K encoding by <math>(w_{j-n+1},\dots,w_j)</math> for some n-1 word sequence at the j'th word in some larger context.<br />
<br />
Let P be a projection matrix common to all n-1 words and let<br />
<br />
<math>\,a_i=Pw_{j-n+i},i=1,\dots,n-1</math><br />
<br />
Let H be the weight matrix from the projection layer to the hidden layer and the state of H would be:<br />
<br />
<math>\,h=tanh(Ha + b)</math> where A is the concatenation of all <math>\,a_i</math> and <math>\,b</math> is some bias vector<br />
<br />
Finally, the output vector would be:<br />
<br />
<math>\,o=Vh+k</math> where V is the weight matrix from hidden to output and k is another bias vector. <math>\,o</math> would be a vector with same dimensions as the total vocabulary size and the probabilities can be calculated from <math>\,o</math> by applying the softmax function.<br />
<br />
= Optimization and Training =<br />
The training was done with standard back-propagation on minimizing the error function:<br />
<br />
<math>\,E=\sum_{i=1}^N t_i\ log p_i + \epsilon(\sum_{i,j}h^2_{ij}+\sum_{i,j}v^2_{ij})</math><br />
<br />
<math>\,t_i</math> is the desired output vector and the summations inside the epsilon bracket are regularization terms to prevent overfitting of <math>\,H</math> and <math>\,V</math>.<br />
<br />
The researchers used stochastic gradient descent to prevent having to sum over millions of examples worth of error and this sped up training time.<br />
<br />
An issue the researchers ran into using this model was that it took a long time to calculate language model probabilities compared to traditional back-off n-grams model and reduced its suitability for real time predictions. To solve this issue, several optimization techniques were used.<br />
<br />
===Lattice rescoring===<br />
<br />
It is common to keep track of additional possible solutions instead of just the most obviously likely solution in a lattice structure, i.e. a tree like structure where branches can merge and each branch represents a possible solution. For example from the paper using a tri-gram model, i.e. predict third word from first two words, the following lattice structure was formed:<br />
<br />
[[File:Lattice.PNG]]</div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Lattice.PNG&diff=26528File:Lattice.PNG2015-11-18T21:35:29Z<p>Rtwang: </p>
<hr />
<div></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=continuous_space_language_models&diff=26514continuous space language models2015-11-18T20:31:53Z<p>Rtwang: </p>
<hr />
<div>= Introduction =<br />
In certain fields of study such as speech recognition or machine translation, for some acoustic signal <math>\,x</math> or the source sentence to be translated <math>\,e</math>, it is common to model these problems as finding the sequence of words <math>\,w^*</math> that has the highest probability of occurring given <math>\,x</math> or <math>\,e</math>. This can be written as:<br />
<br />
<math>w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w)</math><br />
<br />
An acoustic or translation model can then be used for <math>\,P(x|w)</math>, similar to the idea behind LDA and QDA, and it remains to create a language model <math>\,P(w)</math> to estimate the probability of any sequence of words <math>\,w</math>.<br />
<br />
This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate <math>\,P(w)</math>.<br />
<br />
= Back-off n-grams Model =<br />
<br />
A sequence of words will be defined as <math>\,w^i_1=(w_1,w_2,\dots,w_i)</math> and the formula for the probability <math>\,P(w)</math> can be rewritten as:<br />
<br />
<math>P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1)</math><br />
<br />
It is common to estimate <math>\,P(w_i|w^{i-1}_1)</math> through:<br />
<br />
<math>\,P(w_i|w^{i-1}_1)\approx\frac{\mbox{number of occurrence of the sequence} (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence} (w_1,\dots,w_{i-1})}</math><br />
<br />
However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.<br />
<br />
To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:<br />
<br />
<math>\,P(w_i|w^{i-1}_1) = \begin{cases} <br />
\frac{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_{i-1})}, & \mbox{if number of occurrence of}\ (w_1,\dots,w_i)\ \mbox{is greater than some constant K} \\<br />
\alpha P(w_i|w^{i-1}_2), & \mbox{otherwise} <br />
\end{cases}</math><br />
<br />
<math>\,\alpha</math> is typically a discounting factor that is less than 1 to account for the lack of direct data. It usually depends on the word sequence.<br />
<br />
The general algorithm is then, if the data set does contain the sequence then calculate probability directly. Otherwise, apply a discounting factor and calculate the conditional probability with the first word in the sequence removed. For example, if the word sequence was "The dog barked" and it did not exist in the training set then the formula would be written as:<br />
<br />
<math>\,P(\mbox{the,dog,barked}|\mbox{the,dog}) \approx \alpha P(\mbox{dog,barked}|\mbox{dog})</math><br />
<br />
= Model =<br />
<br />
The researchers for this paper sought to find a better model for this probability than the back-off n-grams model. Their approach was to map the n-1 words sequence onto a multi-dimension continuous space using a layer of neural network followed by another layer to estimate the probabilities of all possible next words. The formulas and model goes as follows:<br />
<br />
For some sequence of n-1 words, encode each word using 1 of K encoding, i.e. 1 where the word is indexed and zero everywhere else. Label each 1 of K encoding by <math>(w_{j-n+1},\dots,w_j)</math> for some n-1 word sequence at the j'th word in some larger context.<br />
<br />
Let P be a projection matrix common to all n-1 words and let<br />
<br />
<math>\,a_i=Pw_{j-n+i},i=1,\dots,n-1</math><br />
<br />
Let H be the weight matrix from the projection layer to the hidden layer and the state of H would be:<br />
<br />
<math>\,h=tanh(Ha + b)</math> where A is the concatenation of all <math>\,a_i</math> and <math>\,b</math> is some bias vector<br />
<br />
Finally, the output vector would be:<br />
<br />
<math>\,o=Vh+k</math> where V is the weight matrix from hidden to output and k is another bias vector. <math>\,o</math> would be a vector with same dimensions as the total vocabulary size and the probabilities can be calculated from <math>\,o</math> by applying the softmax function.<br />
<br />
= Optimization and Training =<br />
The training was done with standard back-propagation on minimizing the error function:<br />
<br />
<math>\,E=\sum_{i=1}^N t_i\ log p_i + \epsilon(\sum_{i,j}h^2_{ij}+\sum_{i,j}v^2_{ij})</math><br />
<br />
<math>\,t_i</math> is the desired output vector and the summations inside the epsilon bracket are regularization terms to prevent overfitting of <math>\,H</math> and <math>\,V</math>.<br />
<br />
The researchers applied several optimization techniques to speed up the training process.</div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=continuous_space_language_models&diff=26501continuous space language models2015-11-18T19:35:33Z<p>Rtwang: </p>
<hr />
<div>= Introduction =<br />
In certain fields of study such as speech recognition or machine translation, for some acoustic signal <math>\,x</math> or the source sentence to be translated <math>\,e</math>, it is common to model these problems as finding the sequence of words <math>\,w^*</math> that has the highest probability of occurring given <math>\,x</math> or <math>\,e</math>. This can be written as:<br />
<br />
<math>w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w)</math><br />
<br />
An acoustic or translation model can then be used for <math>\,P(x|w)</math>, similar to the idea behind LDA and QDA, and it remains to create a language model <math>\,P(w)</math> to estimate the probability of any sequence of words <math>\,w</math>.<br />
<br />
This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate <math>\,P(w)</math>.<br />
<br />
= Back-off n-grams Model =<br />
<br />
A sequence of words will be defined as <math>\,w^i_1=(w_1,w_2,\dots,w_i)</math> and the formula for the probability <math>\,P(w)</math> can be rewritten as:<br />
<br />
<math>P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1)</math><br />
<br />
It is common to estimate <math>\,P(w_i|w^{i-1}_1)</math> through:<br />
<br />
<math>\,P(w_i|w^{i-1}_1)\approx\frac{\mbox{number of occurrence of the sequence} (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence} (w_1,\dots,w_{i-1})}</math><br />
<br />
However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.<br />
<br />
To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:<br />
<br />
<math>\,P(w_i|w^{i-1}_1) = \begin{cases} <br />
\frac{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_{i-1})}, & \mbox{if number of occurrence of}\ (w_1,\dots,w_i)\ \mbox{is greater than some constant K} \\<br />
\alpha P(w_i|w^{i-1}_2), & \mbox{otherwise} <br />
\end{cases}</math><br />
<br />
<math>\,\alpha</math> is typically a discounting factor that is less than 1 to account for the lack of direct data. It usually depends on the word sequence.<br />
<br />
The general algorithm is then, if the data set does contain the sequence then calculate probability directly. Otherwise, apply a discounting factor and calculate the conditional probability with the first word in the sequence removed. For example, if the word sequence was "The dog barked" and it did not exist in the training set then the formula would be written as:<br />
<br />
<math>\,P(\mbox{the,dog,barked}|\mbox{the,dog}) \approx \alpha P(\mbox{dog,barked}|\mbox{dog})</math><br />
<br />
= Model =<br />
<br />
The researchers for this paper sought to find a better model for this probability than the back-off n-grams model. Their approach was to map the n-1 words sequence onto a multi-dimension continuous space using a layer of neural network followed by another layer to estimate the probabilities of all possible next words. The formulas and model goes as follows:<br />
<br />
For some sequence of n-1 words, encode each word using 1 of K encoding, i.e. 1 where the word is indexed and zero everywhere else. Label each 1 of K encoding by <math>(w_{j-n+1},\dots,w_j)</math> for some n-1 word sequence at the j'th word in some larger context.<br />
<br />
Let P be a projection matrix common to all n-1 words and let<br />
<br />
<math>\,a_i=Pw_{j-n+i},i=1,\dots,n-1</math></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=continuous_space_language_models&diff=26488continuous space language models2015-11-18T19:04:41Z<p>Rtwang: Created page with "= Introduction = In certain fields of study such as speech recognition or machine translation, for some acoustic signal <math>\,x</math> or the source sentence to be translated <..."</p>
<hr />
<div>= Introduction =<br />
In certain fields of study such as speech recognition or machine translation, for some acoustic signal <math>\,x</math> or the source sentence to be translated <math>\,e</math>, it is common to model these problems as finding the sequence of words <math>\,w^*</math> that has the highest probability of occurring given <math>\,x</math> or <math>\,e</math>. This can be written as:<br />
<br />
<math>w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w)</math><br />
<br />
An acoustic or translation model can then be used for <math>\,P(x|w)</math>, similar to the idea behind LDA and QDA, and it remains to create a language model <math>\,P(w)</math> to estimate the probability of any sequence of words <math>\,w</math>.<br />
<br />
This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate <math>\,P(w)</math>.<br />
<br />
= Back-off n-grams Model =<br />
<br />
A sequence of words will be defined as <math>\,w^i_1=(w_1,w_2,\dots,w_i)</math> and the formula for the probability <math>\,P(w)</math> can be rewritten as:<br />
<br />
<math>P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1)</math><br />
<br />
It is common to estimate <math>\,P(w_i|w^{i-1}_1)</math> through:<br />
<br />
<math>\,P(w_i|w^{i-1}_1)\approx\frac{number\ of\ occurrence\ of\ the\ sequence\ (w_1,\dots,w_i)}{number\ of\ occurrence\ of\ the\ sequence\ (w_1,\dots,w_{i-1})}</math><br />
<br />
However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.<br />
<br />
To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:<br />
<br />
<math>\,P(w_i|w^{i-1}_1) = \begin{cases} <br />
n/2, & \mbox{if }n\mbox{ is even} \\<br />
3n+1, & \mbox{if }n\mbox{ is odd} <br />
\end{cases}</math><br />
<br />
= Theory =</div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=show,_Attend_and_Tell:_Neural_Image_Caption_Generation_with_Visual_Attention&diff=26467show, Attend and Tell: Neural Image Caption Generation with Visual Attention2015-11-18T16:51:18Z<p>Rtwang: /* Decoder: Long Short-Term Memory Network */</p>
<hr />
<div>= Introduction =<br />
<br />
This paper<ref><br />
Xu, Kelvin, et al. [http://arxiv.org/pdf/1502.03044v2.pdf "Show, attend and tell: Neural image caption generation with visual attention."] arXiv preprint arXiv:1502.03044 (2015).<br />
</ref> introduces an attention based model that automatically learns to describe the content of images. It is able to focus on salient parts of the image while generating the corresponding word in the output sentence. A visualization is provided showing which part of the image was attended to to generate each specific word in the output. This can be used to get a sense of what is going on in the model and is especially useful for understanding the kinds of mistakes it makes. The model is tested on three datasets, Flickr8k, Flickr30k, and MS COCO.<br />
<br />
= Motivation =<br />
Caption generation and compressing huge amounts of salient visual information into descriptive language were recently improved by combination of convolutional neural networks and recurrent neural networks. . Using representations from the top layer of a convolutional net that distill information in image down to the most salient objects can lead to losing information which could be useful for richer, more descriptive captions. Retaining this information using more low-level representation was the motivation for the current work.<br />
<br />
= Contributions = <br />
<br />
* Two attention-based image caption generators using a common framework. A "soft" deterministic attention mechanism and a "hard" stochastic mechanism.<br />
* Show how to gain insight and interpret results of this framework by visualizing "where" and "what" the attention focused on.<br />
* Quantitatively validate the usefulness of attention in caption generation with state of the art performance on three datasets (Flickr8k, Flickr30k, and MS COCO)<br />
<br />
= Model =<br />
<br />
[[File:AttentionNetwork.png]]<br />
<br />
== Encoder: Convolutional Features ==<br />
<br />
Model takes in a single image and generates a caption of arbitrary length. The caption is a sequence of [http://stackoverflow.com/questions/17469835/one-hot-encoding-for-machine-learning one-hot encoded words] (binary vector) from a given vocabulary.<br />
<br />
== Decoder: Long Short-Term Memory Network ==<br />
<br />
[[File:AttentionLSTM.png]]<br />
<br />
The purpose of the LSTM is to output a sequence of 1-of-K encodings represented as:<br />
<br />
<math>y={y_1,\dots,y_C},y_i\in\mathbb{R}^K</math>, where C is the length of the caption and K is the vocabulary size<br />
<br />
To generate this sequence of outputs, a set of feature vectors was extracted from the image using a convolutional neural network and represented as:<br />
<br />
<math>a={a_1,\dots,a_L},a_i\in\mathbb{R}^D</math>, where D is the dimension of the feature vector extracted by the convolutional neural network<br />
<br />
Let <math>T_{s,t} : \mathbb{R}^s -> \mathbb{R}^t </math> be a simple affine transformation, i.e.<math>\,Wx + b</math> for some projection weight matrix W and some bias vector b learned as parameters in the LSTM.<br />
<br />
The equations for the LSTM can then be simplified as:<br />
<br />
<math>\begin{pmatrix}i_t\\f_t\\o_t\\g_t\end{pmatrix}=\begin{pmatrix}\sigma\\\sigma\\\sigma\\tanh\end{pmatrix}T_{D+m+n,n}\begin{pmatrix}Ey_{t-1}\\h_{t-1}\\\hat z_{t}\end{pmatrix}</math><br />
<br />
<math>c_t=f_t\odot c_{t-1} + i_t \odot g_t</math><br />
<br />
<math>h_t=o_t \odot tanh(c_t)</math><br />
<br />
where <math>\,i_t,f_t,o_t,g_t,c_t,h_t</math> corresponds the values and gate labels in the diagram. Additionally, <math>\,\sigma</math> is the logistic sigmoid function and both it and <math>\,tanh</math> are applied element wise in the first equation.<br />
<br />
<br />
At each time step, the LSTM outputs the relative probability of every single word in the vocabulary given a context vector, the previous hidden state and the previously generated word. This is done through additional feedforward layers between the LSTM layers and the output layer, known as deep output layer setup, that take the state of the LSTM <math>\,h_t</math> and applies additional transformations to the get relative probability:<br />
<br />
<math>p(y_t,a,y_1^{t-1})\propto exp(L_o(Ey_{t-1}+L_hh_t+L_z\hat z_t))</math><br />
<br />
where <math>L_o\in\mathbb{R}^{Kxm},L_h\in\mathbb{R}^{mxn},L_z\in\mathbb{R}^{mxD},E\in\mathbb{R}^{mxK}</math> are randomly initialized parameters that are learned through training the LSTM. This series of matrix and vector multiplication then results in a vector of dimension K where each element represents the relative probability of the word indexed with that element being next in the sequence of outputs.<br />
<br />
<br />
<math>\hat{z}</math> is the context vector which is a function of the feature vectors <math>a={a_1,\dots,a_L}</math> and the attention model as discussed in the next section.<br />
<br />
== Attention: Two Variants ==<br />
<br />
The attention algorithm is one of the arguments that influences the state of the LSTM. There are two variants of the attention algorithm used: stochastic "hard" and deterministic "soft" attention. The visual differences between the two can be seen in the "Properties" section.<br />
<br />
Stochastic "hard" attention means learning to maximize the context vector <math>\hat{z}</math> from a combination of a one-hot encoded variable <math>s_{t,i}</math> and the extracted features <math>a_{i}</math>. This is called "hard" attention, because a hard choice is made at each feature, however it is stochastic since <math>s_{t,i}</math> is chosen from a mutlinoulli distribution [http://cs.brown.edu/courses/cs195-5/spring2012/lectures/2012-01-31_probabilityDecisions.pdf (see page 11 for an explanation of the distribution of this link)].<br />
<br />
Deterministic soft-attention means learning by maximizing the expectation of the context vector. It is deterministic, since <math>s_{t,i}</math> is not picked from a distribution and it is soft since the individual choices are not optimized, but the whole distribution.<br />
<br />
The actual optimization methods for both of these attention methods are outside the scope of this summary.<br />
<br />
== Properties ==<br />
<br />
"where" the network looks next depends on the sequence of words that has already been generated.<br />
<br />
The attention framework learns latent alignments from scratch instead of explicitly using object detectors. This allows the model to go beyond "objectness" and learn to attend to abstract concepts.<br />
<br />
[[File:AttentionHighlights.png]]<br />
<br />
== Training ==<br />
<br />
Two regularization techniques were used, used drop out and early stopping on BLEU score.<br />
<br />
The MS COCO dataset has more than 5 reference sentences for some of the images, while the Flickr datasets have exactly 5. For consistency, the reference sentences for all images in the MS COCO dataset was truncated to 5. There was also some basic tokenization applied to the MS COCO dataset to be consistent with the tokenization in the Flickr datasets.<br />
<br />
On the largest dataset (MS COCO) the attention model took less than 3 days to train on NVIDIA Titan Black GPU.<br />
<br />
= Results =<br />
<br />
Results reported with the [https://en.wikipedia.org/wiki/BLEU BLEU] and [https://en.wikipedia.org/wiki/METEOR METEOR] metrics. BLEU is one of the most common metrics for translation tasks, but due to some criticism of the metric, another is used as well. Both of these metrics are designed for evaluating machine translation, which is typically from one language to another. Caption generation can be thought of as analogous to translation, where the image is a sentence in the original 'language' and the caption is its translation to English (or another language, but in this case the captions are only in English). <br />
<br />
[[File:AttentionResults.png]]<br />
<br />
[[File:AttentionGettingThingsRight.png]]<br />
<br />
[[File:AttentionGettingThingsWrong.png]]<br />
<br />
=References=<br />
<references /></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=show,_Attend_and_Tell:_Neural_Image_Caption_Generation_with_Visual_Attention&diff=26465show, Attend and Tell: Neural Image Caption Generation with Visual Attention2015-11-18T16:48:01Z<p>Rtwang: /* Decoder: Long Short-Term Memory Network */</p>
<hr />
<div>= Introduction =<br />
<br />
This paper<ref><br />
Xu, Kelvin, et al. [http://arxiv.org/pdf/1502.03044v2.pdf "Show, attend and tell: Neural image caption generation with visual attention."] arXiv preprint arXiv:1502.03044 (2015).<br />
</ref> introduces an attention based model that automatically learns to describe the content of images. It is able to focus on salient parts of the image while generating the corresponding word in the output sentence. A visualization is provided showing which part of the image was attended to to generate each specific word in the output. This can be used to get a sense of what is going on in the model and is especially useful for understanding the kinds of mistakes it makes. The model is tested on three datasets, Flickr8k, Flickr30k, and MS COCO.<br />
<br />
= Motivation =<br />
Caption generation and compressing huge amounts of salient visual information into descriptive language were recently improved by combination of convolutional neural networks and recurrent neural networks. . Using representations from the top layer of a convolutional net that distill information in image down to the most salient objects can lead to losing information which could be useful for richer, more descriptive captions. Retaining this information using more low-level representation was the motivation for the current work.<br />
<br />
= Contributions = <br />
<br />
* Two attention-based image caption generators using a common framework. A "soft" deterministic attention mechanism and a "hard" stochastic mechanism.<br />
* Show how to gain insight and interpret results of this framework by visualizing "where" and "what" the attention focused on.<br />
* Quantitatively validate the usefulness of attention in caption generation with state of the art performance on three datasets (Flickr8k, Flickr30k, and MS COCO)<br />
<br />
= Model =<br />
<br />
[[File:AttentionNetwork.png]]<br />
<br />
== Encoder: Convolutional Features ==<br />
<br />
Model takes in a single image and generates a caption of arbitrary length. The caption is a sequence of [http://stackoverflow.com/questions/17469835/one-hot-encoding-for-machine-learning one-hot encoded words] (binary vector) from a given vocabulary.<br />
<br />
== Decoder: Long Short-Term Memory Network ==<br />
<br />
[[File:AttentionLSTM.png]]<br />
<br />
The purpose of the LSTM is to output a sequence of 1-of-K encodings represented as:<br />
<br />
<math>y={y_1,\dots,y_C},y_i\in\mathbb{R}^K</math>, where C is the length of the caption and K is the vocabulary size<br />
<br />
To generate this sequence of outputs, a set of feature vectors was extracted from the image using a convolutional neural network and represented as:<br />
<br />
<math>a={a_1,\dots,a_L},a_i\in\mathbb{R}^D</math>, where D is the dimension of the feature vector extracted by the convolutional neural network<br />
<br />
Let <math>T_{s,t} : \mathbb{R}^s -> \mathbb{R}^t </math> be a simple affine transformation, i.e.<math>\,Wx + b</math> for some projection weight matrix W and some bias vector b learned as parameters in the LSTM.<br />
<br />
The equations for the LSTM can then be simplified as:<br />
<br />
<math>\begin{pmatrix}i_t\\f_t\\o_t\\g_t\end{pmatrix}=\begin{pmatrix}\sigma\\\sigma\\\sigma\\tanh\end{pmatrix}T_{s,t}\begin{pmatrix}Ey_{t-1}\\h_{t-1}\\\hat z_{t}\end{pmatrix}</math><br />
<br />
<math>c_t=f_t\odot c_{t-1} + i_t \odot g_t</math><br />
<br />
<math>h_t=o_t \odot tanh(c_t)</math><br />
<br />
where <math>\,i_t,f_t,o_t,g_t,c_t,h_t</math> corresponds the values and gate labels in the diagram. Additionally, <math>\,\sigma</math> is the logistic sigmoid function and both it and <math>\,tanh</math> are applied element wise in the first equation.<br />
<br />
<br />
At each time step, the LSTM outputs the relative probability of every single word in the vocabulary given a context vector, the previous hidden state and the previously generated word. This is done through additional feedforward layers between the LSTM layers and the output layer, known as deep output layer setup, that take the state of the LSTM <math>\,h_t</math> and applies additional transformations to the get relative probability:<br />
<br />
<math>p(y_t,a,y_1^{t-1})\propto exp(L_o(Ey_{t-1}+L_hh_t+L_z\hat z_t))</math><br />
<br />
where <math>L_o\in\mathbb{R}^{Kxm},L_h\in\mathbb{R}^{mxn},L_z\in\mathbb{R}^{mxD},E\in\mathbb{R}^{mxK}</math> are randomly initialized parameters that are learned through training the LSTM. This series of matrix and vector multiplication then results in a vector of dimension K where each element represents the relative probability of the word indexed with that element being next in the sequence of outputs.<br />
<br />
<br />
<math>\hat{z}</math> is the context vector which is a function of the feature vectors <math>a={a_1,\dots,a_L}</math> and the attention model as discussed in the next section.<br />
<br />
== Attention: Two Variants ==<br />
<br />
The attention algorithm is one of the arguments that influences the state of the LSTM. There are two variants of the attention algorithm used: stochastic "hard" and deterministic "soft" attention. The visual differences between the two can be seen in the "Properties" section.<br />
<br />
Stochastic "hard" attention means learning to maximize the context vector <math>\hat{z}</math> from a combination of a one-hot encoded variable <math>s_{t,i}</math> and the extracted features <math>a_{i}</math>. This is called "hard" attention, because a hard choice is made at each feature, however it is stochastic since <math>s_{t,i}</math> is chosen from a mutlinoulli distribution [http://cs.brown.edu/courses/cs195-5/spring2012/lectures/2012-01-31_probabilityDecisions.pdf (see page 11 for an explanation of the distribution of this link)].<br />
<br />
Deterministic soft-attention means learning by maximizing the expectation of the context vector. It is deterministic, since <math>s_{t,i}</math> is not picked from a distribution and it is soft since the individual choices are not optimized, but the whole distribution.<br />
<br />
The actual optimization methods for both of these attention methods are outside the scope of this summary.<br />
<br />
== Properties ==<br />
<br />
"where" the network looks next depends on the sequence of words that has already been generated.<br />
<br />
The attention framework learns latent alignments from scratch instead of explicitly using object detectors. This allows the model to go beyond "objectness" and learn to attend to abstract concepts.<br />
<br />
[[File:AttentionHighlights.png]]<br />
<br />
== Training ==<br />
<br />
Two regularization techniques were used, used drop out and early stopping on BLEU score.<br />
<br />
The MS COCO dataset has more than 5 reference sentences for some of the images, while the Flickr datasets have exactly 5. For consistency, the reference sentences for all images in the MS COCO dataset was truncated to 5. There was also some basic tokenization applied to the MS COCO dataset to be consistent with the tokenization in the Flickr datasets.<br />
<br />
On the largest dataset (MS COCO) the attention model took less than 3 days to train on NVIDIA Titan Black GPU.<br />
<br />
= Results =<br />
<br />
Results reported with the [https://en.wikipedia.org/wiki/BLEU BLEU] and [https://en.wikipedia.org/wiki/METEOR METEOR] metrics. BLEU is one of the most common metrics for translation tasks, but due to some criticism of the metric, another is used as well. Both of these metrics are designed for evaluating machine translation, which is typically from one language to another. Caption generation can be thought of as analogous to translation, where the image is a sentence in the original 'language' and the caption is its translation to English (or another language, but in this case the captions are only in English). <br />
<br />
[[File:AttentionResults.png]]<br />
<br />
[[File:AttentionGettingThingsRight.png]]<br />
<br />
[[File:AttentionGettingThingsWrong.png]]<br />
<br />
=References=<br />
<references /></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=show,_Attend_and_Tell:_Neural_Image_Caption_Generation_with_Visual_Attention&diff=26343show, Attend and Tell: Neural Image Caption Generation with Visual Attention2015-11-17T02:11:08Z<p>Rtwang: /* Decoder: Long Short-Term Memory Network */</p>
<hr />
<div>= Introduction =<br />
<br />
This paper<ref><br />
Xu, Kelvin, et al. [http://arxiv.org/pdf/1502.03044v2.pdf "Show, attend and tell: Neural image caption generation with visual attention."] arXiv preprint arXiv:1502.03044 (2015).<br />
</ref> introduces an attention based model that automatically learns to describe the content of images. It is able to focus on salient parts of the image while generating the corresponding word in the output sentence. A visualization is provided showing which part of the image was attended to to generate each specific word in the output. This can be used to get a sense of what is going on in the model and is especially useful for understanding the kinds of mistakes it makes. The model is tested on three datasets, Flickr8k, Flickr30k, and MS COCO.<br />
<br />
= Motivation =<br />
Caption generation and compressing huge amounts of salient visual information into descriptive language were recently improved by combination of convolutional neural networks and recurrent neural networks. . Using representations from the top layer of a convolutional net that distill information in image down to the most salient objects can lead to losing information which could be useful for richer, more descriptive captions. Retaining this information using more low-level representation was the motivation for the current work.<br />
<br />
= Contributions = <br />
<br />
* Two attention-based image caption generators using a common framework. A "soft" deterministic attention mechanism and a "hard" stochastic mechanism.<br />
* Show how to gain insight and interpret results of this framework by visualizing "where" and "what" the attention focused on.<br />
* Quantitatively validate the usefulness of attention in caption generation with state of the art performance on three datasets (Flickr8k, Flickr30k, and MS COCO)<br />
<br />
= Model =<br />
<br />
[[File:AttentionNetwork.png]]<br />
<br />
== Encoder: Convolutional Features ==<br />
<br />
Model takes in a single image and generates a caption of arbitrary length. The caption is a sequence of [http://stackoverflow.com/questions/17469835/one-hot-encoding-for-machine-learning one-hot encoded words] (binary vector) from a given vocabulary.<br />
<br />
== Decoder: Long Short-Term Memory Network ==<br />
<br />
[[File:AttentionLSTM.png]]<br />
<br />
The purpose of the LSTM is to output a sequence of 1-of-K encodings represented as:<br />
<br />
<math>y={y_1,\dots,y_C},y_i\in\mathbb{R}^K</math>, where C is the length of the caption and K is the vocabulary size<br />
<br />
To generate this sequence of outputs, a set of feature vectors was extracted from the image using a convolutional neural network and represented as:<br />
<br />
<math>a={a_1,\dots,a_L},a_i\in\mathbb{R}^D</math>, where D is the dimension of the feature vector extracted by the convolutional neural network<br />
<br />
<br />
At each time step, the LSTM outputs the relative probability of every single word in the vocabulary given a context vector, the previous hidden state and the previously generated word:<br />
<br />
<math>p(y_t,a,y_1^{t-1})\propto exp(L_o(Ey_{t-1}+L_hh_t+L_z\hat z_t))</math><br />
<br />
where <math>L_o\in\mathbb{R}^{Kxm},L_h\in\mathbb{R}^{mxn},L_z\in\mathbb{R}^{mxD},E\in\mathbb{R}^{mxK}</math> are randomly initialized parameters that are learned through training the LSTM. The researchers also included additional feedforward layers between the LSTM layers and the output layer that take the output of the LSTM and applies additional transformations. This series of matrix and vector multiplication then results in a vector of dimension K where each element represents the relative probability of the word indexed with that element being next in the sequence of outputs.<br />
<br />
<br />
<math>\hat{z}</math> is the context vector which is a function of the feature vectors <math>a={a_1,\dots,a_L}</math> and the attention model as discussed in the next section.<br />
<br />
== Attention: Two Variants ==<br />
<br />
The attention algorithm is one of the arguments that influences the state of the LSTM. There are two variants of the attention algorithm used: stochastic "hard" and deterministic "soft" attention. The visual differences between the two can be seen in the "Properties" section.<br />
<br />
Stochastic "hard" attention means learning to maximize the context vector <math>\hat{z}</math> from a combination of a one-hot encoded variable <math>s_{t,i}</math> and the extracted features <math>a_{i}</math>. This is called "hard" attention, because a hard choice is made at each feature, however it is stochastic since <math>s_{t,i}</math> is chosen from a mutlinoulli distribution [http://cs.brown.edu/courses/cs195-5/spring2012/lectures/2012-01-31_probabilityDecisions.pdf (see page 11 for an explanation of the distribution of this link)].<br />
<br />
Deterministic soft-attention means learning by maximizing the expectation of the context vector. It is deterministic, since <math>s_{t,i}</math> is not picked from a distribution and it is soft since the individual choices are not optimized, but the whole distribution.<br />
<br />
The actual optimization methods for both of these attention methods are outside the scope of this summary.<br />
<br />
== Properties ==<br />
<br />
"where" the network looks next depends on the sequence of words that has already been generated.<br />
<br />
The attention framework learns latent alignments from scratch instead of explicitly using object detectors. This allows the model to go beyond "objectness" and learn to attend to abstract concepts.<br />
<br />
[[File:AttentionHighlights.png]]<br />
<br />
== Training ==<br />
<br />
Two regularization techniques were used, used drop out and early stopping on BLEU score.<br />
<br />
The MS COCO dataset has more than 5 reference sentences for some of the images, while the Flickr datasets have exactly 5. For consistency, the reference sentences for all images in the MS COCO dataset was truncated to 5. There was also some basic tokenization applied to the MS COCO dataset to be consistent with the tokenization in the Flickr datasets.<br />
<br />
On the largest dataset (MS COCO) the attention model took less than 3 days to train on NVIDIA Titan Black GPU.<br />
<br />
= Results =<br />
<br />
Results reported with the [https://en.wikipedia.org/wiki/BLEU BLEU] and [https://en.wikipedia.org/wiki/METEOR METEOR] metrics. BLEU is one of the most common metrics for translation tasks, but due to some criticism of the metric, another is used as well. Both of these metrics are designed for evaluating machine translation, which is typically from one language to another. Caption generation can be thought of as analogous to translation, where the image is a sentence in the original 'language' and the caption is its translation to English (or another language, but in this case the captions are only in English). <br />
<br />
[[File:AttentionResults.png]]<br />
<br />
[[File:AttentionGettingThingsRight.png]]<br />
<br />
[[File:AttentionGettingThingsWrong.png]]<br />
<br />
=References=<br />
<references /></div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f15Stat946PaperSignUp&diff=26262f15Stat946PaperSignUp2015-11-15T18:03:07Z<p>Rtwang: </p>
<hr />
<div> <br />
=[https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/listofpapers1.pdf List of Papers]=<br />
<br />
= Record your contributions [https://docs.google.com/spreadsheets/d/1A_0ej3S6ns3bBMwWLS4pwA6zDLz_0Ivwujj-d1Gr9eo/edit?usp=sharing here:]=<br />
<br />
Use the following notations:<br />
<br />
S: You have written a summary on the paper<br />
<br />
T: You had technical contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
E: You had editorial contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
[http://goo.gl/forms/RASFRZXoxJ Your feedback on presentations]<br />
<br />
<br />
=Set A=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Oct 16 || pascal poupart || || Guest Lecturer||||<br />
|-<br />
|Oct 16 ||pascal poupart || ||Guest Lecturer ||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 ||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Oct 23 || Deepak Rishi || || Parsing natural scenes and natural language with recursive neural networks || [http://www-nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf Paper] || [[Parsing natural scenes and natural language with recursive neural networks | Summary]]<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 ||Rui Qiao || ||Going deeper with convolutions || [http://arxiv.org/pdf/1409.4842v1.pdf Paper]|| [[GoingDeeperWithConvolutions|Summary]]<br />
|-<br />
|Oct 30 ||Amirreza Lashkari|| 21 ||Overfeat: integrated recognition, localization and detection using convolutional networks. || [http://arxiv.org/pdf/1312.6229v4.pdf Paper]|| [[Overfeat: integrated recognition, localization and detection using convolutional networks|Summary]]<br />
|-<br />
|Mkeup Class (TBA) || Peter Blouw|| ||Memory Networks.|| [http://arxiv.org/abs/1410.3916]|| [[Memory Networks|Summary]]<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Anthony Caterini ||56 || Human-level control through deep reinforcement learning ||[http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf Paper]|| [[Human-level control through deep reinforcement learning|Summary]]<br />
|-<br />
|Nov 6 || Sean Aubin || ||Learning Hierarchical Features for Scene Labeling ||[http://yann.lecun.com/exdb/publis/pdf/farabet-pami-13.pdf Paper]||[[Learning Hierarchical Features for Scene Labeling|Summary]]<br />
|-<br />
|Nov 13|| Mike Hynes || 12 ||Speech recognition with deep recurrent neural networks || [http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf Paper] || [[Graves et al., Speech recognition with deep recurrent neural networks|Summary]]<br />
|-<br />
|Nov 13 || Tim Tse || || Question Answering with Subgraph Embeddings || [http://arxiv.org/pdf/1406.3676v3.pdf Paper] || [[Question Answering with Subgraph Embeddings | Summary ]]<br />
|-<br />
|Nov 13 || Maysum Panju || ||Neural machine translation by jointly learning to align and translate ||[http://arxiv.org/pdf/1409.0473v6.pdf Paper] || [[Neural Machine Translation: Jointly Learning to Align and Translate|Summary]]<br />
|-<br />
|Nov 13 || Abdullah Rashwan || || Deep neural networks for acoustic modeling in speech recognition. ||[http://research.microsoft.com/pubs/171498/HintonDengYuEtAl-SPM2012.pdf paper]|| [[Deep neural networks for acoustic modeling in speech recognition| Summary]]<br />
|-<br />
|Nov 20 || Valerie Platsko || ||Natural language processing (almost) from scratch. ||[http://arxiv.org/pdf/1103.0398.pdf Paper]|| [[Natural language processing (almost) from scratch. | Summary]]<br />
|-<br />
|Nov 20 || Brent Komer || ||Show, Attend and Tell: Neural Image Caption Generation with Visual Attention || [http://arxiv.org/pdf/1502.03044v2.pdf Paper]||[[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention|Summary]]<br />
|-<br />
|Nov 20 || Luyao Ruan || || Dropout: A Simple Way to Prevent Neural Networks from Overfitting || [https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf Paper]|| [[dropout | Summary]]<br />
|-<br />
|Nov 20 || Ali Mahdipour || || The human splicing code reveals new insights into the genetic determinants of disease ||[https://www.sciencemag.org/content/347/6218/1254806.full.pdf Paper] || [[Genetics | Summary]]<br />
|-<br />
|Nov 27 ||Mahmood Gohari || ||Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships ||[http://pubs.acs.org/doi/abs/10.1021/ci500747n.pdf Paper]||<br />
|-<br />
|Nov 27 || Derek Latremouille || ||The Wake-Sleep Algorithm for Unsupervised Neural Networks || [http://www.gatsby.ucl.ac.uk/~dayan/papers/hdfn95.pdf Paper] ||<br />
|-<br />
|Nov 27 ||Xinran Liu || ||ImageNet Classification with Deep Convolutional Neural Networks ||[http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Paper]||[[ImageNet Classification with Deep Convolutional Neural Networks|Summary]]<br />
|-<br />
|Nov 27 ||Ali Sarhadi|| ||Strategies for Training Large Scale Neural Network Language Models||||<br />
|-<br />
|Dec 4 || Chris Choi || || On the difficulty of training recurrent neural networks || [http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf Paper] || [[On the difficulty of training recurrent neural networks | Summary]]<br />
|-<br />
|Dec 4 || Fatemeh Karimi || ||MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION||[http://arxiv.org/pdf/1412.7755v2.pdf Paper]||<br />
|-<br />
|Dec 4 || Jan Gosmann || || On the Number of Linear Regions of Deep Neural Networks || [http://arxiv.org/abs/1402.1869 Paper] || [[On the Number of Linear Regions of Deep Neural Networks | Summary]]<br />
|-<br />
|Dec 4 || Dylan Drover || || Towards AI-complete question answering: a set of prerequisite toy tasks || [http://arxiv.org/pdf/1502.05698.pdf Paper] ||<br />
|-<br />
|}<br />
|}<br />
<br />
=Set B=<br />
<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Anthony Caterini ||15 ||The Manifold Tangent Classifier ||[http://papers.nips.cc/paper/4409-the-manifold-tangent-classifier.pdf Paper]||<br />
|-<br />
|Jan Gosmann || || Neural Turing machines || [http://arxiv.org/abs/1410.5401 Paper] || [[Neural Turing Machines|Summary]]<br />
|-<br />
|Brent Komer || || Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers || [http://arxiv.org/pdf/1202.2160v2.pdf Paper] ||<br />
|-<br />
|Sean Aubin || || Deep Sparse Rectifier Neural Networks || [http://jmlr.csail.mit.edu/proceedings/papers/v15/glorot11a/glorot11a.pdf Paper] || [[Deep Sparse Rectifier Neural Networks|Summary]]<br />
|-<br />
|Peter Blouw|| || Generating text with recurrent neural networks || [http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf Paper] ||<br />
|-<br />
|Tim Tse|| || From Machine Learning to Machine Reasoning || [http://research.microsoft.com/pubs/206768/mlj-2013.pdf Paper] || [[From Machine Learning to Machine Reasoning | Summary ]]<br />
|-<br />
|Rui Qiao|| || Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation || [http://arxiv.org/pdf/1406.1078v3.pdf Paper] || [[Learning Phrase Representations|Summary]]<br />
|-<br />
|Ftemeh Karimi|| 23 || Very Deep Convoloutional Networks for Large-Scale Image Recognition || [http://arxiv.org/pdf/1409.1556.pdf Paper] || [[Very Deep Convoloutional Networks for Large-Scale Image Recognition|Summary]]<br />
|-<br />
|Amirreza Lashkari|| 43 || Distributed Representations of Words and Phrases and their Compositionality || [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Paper] || [[Distributed Representations of Words and Phrases and their Compositionality|Summary]]<br />
|-<br />
|Xinran Liu|| 19 || Joint training of a convolutional network and a graphical model for human pose estimation || [http://papers.nips.cc/paper/5573-joint-training-of-a-convolutional-network-and-a-graphical-model-for-human-pose-estimation.pdf Paper] || [[Joint training of a convolutional network and a graphical model for human pose estimation|Summary]]<br />
|-<br />
|Chris Choi|| || Learning Long-Range Vision for Autonomous Off-Road Driving || [http://yann.lecun.com/exdb/publis/pdf/hadsell-jfr-09.pdf Paper] || [[Learning Long-Range Vision for Autonomous Off-Road Driving|Summary]]<br />
|-<br />
|Luyao Ruan|| || Deep Learning of the tissue-regulated splicing code || [http://bioinformatics.oxfordjournals.org/content/30/12/i121.full.pdf+html Paper] || [[Deep Learning of the tissue-regulated splicing code| Summary]]<br />
|-<br />
|Abdullah Rashwan|| || Deep Convolutional Neural Networks For LVCSR || [http://www.cs.toronto.edu/~asamir/papers/icassp13_cnn.pdf paper] || [[Deep Convolutional Neural Networks For LVCSR| Summary]]<br />
|-<br />
|Mahmood Gohari||37 || On using very large target vocabulary for neural machine translation || [http://arxiv.org/pdf/1412.2007v2.pdf paper] || [[On using very large target vocabulary for neural machine translation| Summary]]<br />
|-<br />
|Valerie Platsko|| || Learning Convolutional Feature Hierarchies for Visual Recognition || [http://papers.nips.cc/paper/4133-learning-convolutional-feature-hierarchies-for-visual-recognition Paper] || [[Learning Convolutional Feature Hierarchies for Visual Recognition | Summary]]<br />
|-<br />
|Derek Latremouille|| || Learning fast approximations of sparse coding || [http://yann.lecun.com/exdb/publis/pdf/gregor-icml-10.pdf Paper] || [[Learning fast approximations of sparse coding | Summary]]<br />
|-<br />
|Ri Wang|| || Continuous space language models || [https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenSemester2_2009_10/sdarticle.pdf Paper] || [[Continuous space language models | Summary]]</div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946f15/Sequence_to_sequence_learning_with_neural_networks&diff=25376stat946f15/Sequence to sequence learning with neural networks2015-10-17T17:35:23Z<p>Rtwang: </p>
<hr />
<div>= Introduction =<br />
The emergence of the Internet and other modern technology has greatly increased people's ability to communicate across vast distances and barriers. However, there still remains the fundamental barrier of languages and as anyone who has attempted to learn a new language can attest, it takes tremendous amount of work to learn more than one language past childhood. The ability to efficiently and quickly translate between languages would then be of great importance. This is an extremely difficult problem however as languages can have varying grammar and context always plays an important role. For example, the word "back" means entirely different things in the following two sentences,<br />
<br />
<blockquote><br />
I am in the back of the car.<br />
</blockquote><br />
<br />
<blockquote><br />
My back hurts.<br />
</blockquote><br />
<br />
Deep neural networks have proven to be very capable in solving some other difficult problems such as reproducing sound waves from videos (need source) and a sufficiently complex neural network might provide an excellent solution in this case as well. The purpose of the paper is to apply multi-layer long short-term memory neural networks to this machine language translation problem and assess the accuracy in translation for this approach.<br />
<br />
= Model =<br />
=== Long Short-Term Memory Recurrent Neural Network (LSTM) ===<br />
Recurrent neural networks are a variation of deep neural networks that are capable of storing information about previous hidden states in special memory layers. Unlike feed forward neural networks that take in a single fixed length vector input and output a fixed length vector output, recurrent neural networks can take in a sequence of fixed length vectors as input because of their ability to store information and maintain a connection between inputs through this memory layer. By comparison, previous inputs would have no impact on current output for feed forward neural networks whereas they can impact current input in a recurrent neural network.<br />
<br />
<br />
This form of input fits naturally with language translation since sentences are sequences of words and many problems regarding representing variable length sentences as fixed length vectors can be avoided. However, training recurrent neural networks to learn long time lag dependencies where inputs many time steps back can heavily influence current output is difficult and generally results in exploding or vanishing gradients. A variation of recurrent neural networks, long short-term memory neural network, was used instead for this paper as they do not suffer as much from vanishing gradient problem.<br />
<br />
<br />
The purpose of LSTM in this case is to estimate the conditional probability of the output sequence, <math>\,(y_1,\cdots,y_{T'})</math>, based on the input sequence, <math>\,(x_1,\cdots,x_{T})</math>, where <math>\,T</math> does not have to equal <math>\,T'</math><br />
<br />
<br />
Let <math>\,v</math> represent the state of hidden layers after <math>\,(x_1,\cdots,x_{T})</math> have been inputted into the LSTM, i.e. what has been stored in the neural network's memory, then<br />
<br />
<math>\,p(y_1,\cdots,y_{T'}|x_1,\cdots,x_{T})=\prod_{t=1}^{T'} p(y_t|v,y_1,\cdots,y_{t-1})</math><br />
<br />
For each <math>\,p(y_t|v,y_1,\cdots,y_{t-1})</math>, The LSTM neural network at time step <math>\,t</math> after <math>\,(x_1,\cdots,x_T,y_1,\cdots,y_{t-1})</math> have been inputted would output the relative probability of each word in the vocabulary and softmax function, <math>\,\frac{e^{x_b}}{\sum_{t=1}^N e^{x_t}}\,</math> can be applied to this output vector to generate the corresponding probability. From this, we can calculate any <math>\,p(y_1,\cdots,y_{T'}|x_1,\cdots,x_{T})</math> by repeatedly adding <math>\,y_t</math> as input into the LSTM neural network to calculate the new set of probabilities.<br />
<br />
The objective function used during the training process was:<br />
<br />
<math>\,\frac{1}{|T_r|}\sum_{(S,T)\in T_r} log(p(T|S))\,</math><br />
<br />
Where <math>\,S</math> is the base/source sentence, <math>\,T</math> is the paired translated sentence and <math>\,T_r</math> is the total training set. This objective function is to maximize the log probability of a correct translation <math>\,T</math> given the base/source sentence <math>\,S</math> over the entire training set.<br />
<br />
=== Input and Output Data Transformation ===<br />
About 12 million English-French sentence pairs were used during the training with a vocabulary of 160,000 for English and 80,000 for French. Any unknown words were replaced with a special token. All sentences were attached with an <EOS> token to indicate end of sentence.<br />
<br />
Additionally, input sentences were entered backwards as the researchers found this significantly increased accuracy. For example, using the sentence "Today I went to lectures.", the input order would be "lectures,to,went,I,Today". They suspect this is due to reduction of time lag between the beginning of each sentence.<br />
<br />
To decode a translation after training, a simple left to right beam search algorithm is used. This process goes as follows, a small number of initial translations with highest probabilities are picked at the start. Each translation is then re-entered into the LSTM independently and a new small set of words with highest probabilities are appended to the end of each translation. This repeats until <EOS> token is chosen and the completely translated sentence is added to the final translation set which is then ranked and highest ranking translation chosen. <br />
<br />
<br />
= Training and Results =<br />
=== Training Method ===<br />
Two LSTM neural networks were used overall; one to generate a fixed vector representation from the input sequence and another to generate the output sequence from the fixed vector representation. Each neural network had 4 layers and 1000 cells per layer and <math>\,v</math> can be represented by the 8000 real numbers in each cell's memory after the input sequence has been entered. Stochastic gradient descent with a batch size of 128 and learning rate of 0.7 was used. Initial parameters were set using a uniform distribution between -0.08 and 0.08.<br />
=== Results ===<br />
The resulting LSTM neural networks outperformed standard SMT with a BLEU score of 34.8 against 33.3 and with certain heuristics or modification, was very close to matching the best performing system. Additionally, it could recognize sentences in both active and passive voice as being similar.<br />
<blockquote><br />
Active Voice: I ate an apple.<br />
</blockquote><br />
<blockquote><br />
Passive Voice: The apple was eaten by me.<br />
</blockquote><br />
<br />
Lastly, it proved quite capable of translating long sentences despite potentially long delay between input time steps.<br />
<br />
= Source =<br />
Sutskever, I. Vinyals, O. & Le. Q. V. Sequence to sequence learning<br />
with neural networks. In Proc. Advances in Neural Information<br />
Processing Systems 27 3104–3112 (2014).</div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946f15/Sequence_to_sequence_learning_with_neural_networks&diff=25366stat946f15/Sequence to sequence learning with neural networks2015-10-17T00:38:57Z<p>Rtwang: </p>
<hr />
<div>= Introduction =<br />
The emergence of the Internet and other modern technology has greatly increased people's ability to communicate across vast distances and barriers. However, there still remains the fundamental barrier of languages and as anyone who has attempted to learn a new language can attest, it takes tremendous amount of work to learn more than one language past childhood. The ability to efficiently and quickly translate between languages would then be of great importance. This is an extremely difficult problem however as languages can have varying grammar and context always plays an important role. For example, the word "back" means entirely different things in the following two sentences,<br />
<br />
<blockquote><br />
I am in the back of the car.<br />
</blockquote><br />
<br />
<blockquote><br />
My back hurts.<br />
</blockquote><br />
<br />
Deep neural networks have proven to be very capable in solving some other difficult problems such as reproducing sound waves from videos (need source) and a sufficiently complex neural network might provide an excellent solution in this case as well. The purpose of the paper is to apply multi-layer long short-term memory neural networks to this machine language translation problem and assess the accuracy in translation for this approach.<br />
<br />
= Model =<br />
=== Long Short-Term Memory Recurrent Neural Network (LSTM) ===<br />
Recurrent neural networks are a variation of deep neural networks that are capable of storing information about previous hidden states in special memory layers. Unlike feed forward neural networks that take in a single fixed length vector input and output a fixed length vector output, recurrent neural networks can take in a sequence of fixed length vectors as input because of their ability to store information and maintain a connection between inputs through this memory layer. By comparison, previous inputs would have no impact on current output for feed forward neural networks whereas they can impact current input in a recurrent neural network.<br />
<br />
<br />
This form of input fits naturally with language translation since sentences are sequences of words and many problems regarding representing variable length sentences as fixed length vectors can be avoided. However, training recurrent neural networks to learn long time lag dependencies where inputs many time steps back can heavily influence current output is difficult and generally results in exploding or vanishing gradients. A variation of recurrent neural networks, long short-term memory neural network, was used instead for this paper as they do not suffer as much from vanishing gradient problem.<br />
<br />
<br />
The purpose of LSTM in this case is to estimate the conditional probability of the output sequence, <math>\,(y_1,\cdots,y_{T'})</math>, based on the input sequence, <math>\,(x_1,\cdots,x_{T})</math>, where <math>\,T</math> does not have to equal <math>\,T'</math><br />
<br />
<br />
Let <math>\,v</math> represent the state of hidden layers after <math>\,(x_1,\cdots,x_{T})</math> have been inputted into the LSTM, i.e. what has been stored in the neural network's memory, then<br />
<br />
<math>\,p(y_1,\cdots,y_{T'}|x_1,\cdots,x_{T})=\prod_{t=1}^{T'} p(y_t|v,y_1,\cdots,y_{t-1})</math><br />
<br />
For each <math>\,p(y_t|v,y_1,\cdots,y_{t-1})</math>, The LSTM neural network at time step <math>\,t</math> after <math>\,(x_1,\cdots,x_T,y_1,\cdots,y_{t-1})</math> have been inputted would output the relative probability of each word in the vocabulary and softmax function, <math>\,\frac{e^{x_b}}{\sum_{t=1}^N e^{x_t}}\,</math> can be applied to this output vector to generate the corresponding probability. From this, we can calculate any <math>\,p(y_1,\cdots,y_{T'}|x_1,\cdots,x_{T})</math> by repeatedly adding <math>\,y_t</math> as input into the LSTM neural network to calculate the new set of probabilities.<br />
<br />
The objective function used during the training process was:<br />
<br />
<math>\,\frac{1}{|T_r|}\sum_{(S,T)\in T_r} log(p(T|S))\,</math><br />
<br />
Where <math>\,S</math> is the base/source sentence, <math>\,T</math> is the paired translated sentence and <math>\,T_r</math> is the total training set. This objective function is to maximize the log probability of a correct translation <math>\,T</math> given the base/source sentence <math>\,S</math> over the entire training set.<br />
<br />
=== Input and Output Data Transformation ===<br />
About 12 million English-French sentence pairs were used during the training with a vocabulary of 160,000 for English and 80,000 for French. Any unknown words were replaced with a special token. All sentences were attached with an <EOS> token to indicate end of sentence.<br />
<br />
Additionally, input sentences were entered backwards as the researchers found this significantly increased accuracy. For example, using the sentence "Today I went to lectures.", the input order would be "lectures,to,went,I,Today". They suspect this is due to reduction of time lag between the beginning of each sentence.<br />
<br />
To decode a translation after training, a simple left to right beam search algorithm is used. This process goes as follows, a small number of initial translations with highest probabilities are picked at the start. Each translation is then re-entered into the LSTM independently and a new small set of words with highest probabilities are appended to the end of each translation. This repeats until <EOS> token is chosen and the completely translated sentence is added to the final translation set which is then ranked and highest ranking translation chosen. <br />
<br />
<br />
= Training and Results =<br />
=== Training Method ===<br />
Two LSTM neural networks were used overall; one to generate a fixed vector representation from the input sequence and another to generate the output sequence from the fixed vector representation. Each neural network had 4 layers and 1000 cells per layer and <math>\,v</math> can be represented by the 8000 real numbers in each cell's memory after the input sequence has been entered. Stochastic gradient descent with a batch size of 128 and learning rate of 0.7 was used. Initial parameters were set using a uniform distribution between -0.08 and 0.08.<br />
=== Results ===<br />
The resulting LSTM neural networks outperformed standard SMT with a BLEU score of 34.8 against 33.3 and with certain heuristics or modification, was very close to matching the best performing system. Additionally, it could recognize sentences in both active and passive voice as being similar.<br />
<blockquote><br />
Active Voice: I ate an apple.<br />
</blockquote><br />
<blockquote><br />
Passive Voice: The apple was eaten by me.<br />
</blockquote><br />
<br />
Lastly, it proved quite capable of translating long sentences despite potentially long delay between input time steps.</div>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946f15/Sequence_to_sequence_learning_with_neural_networks&diff=25365stat946f15/Sequence to sequence learning with neural networks2015-10-17T00:14:10Z<p>Rtwang: </p>
<hr />
<div>= Introduction =<br />
The emergence of the Internet and other modern technology has greatly increased people's ability to communicate across vast distances and barriers. However, there still remains the fundamental barrier of languages and as anyone who has attempted to learn a new language can attest, it takes tremendous amount of work to learn more than one language past childhood. The ability to efficiently and quickly translate between languages would then be of great importance. This is an extremely difficult problem however as languages can have varying grammar and context always plays an important role. For example, the word "back" means entirely different things in the following two sentences,<br />
<br />
<blockquote><br />
I am in the back of the car.<br />
</blockquote><br />
<br />
<blockquote><br />
My back hurts.<br />
</blockquote><br />
<br />
Deep neural networks have proven to be very capable in solving some other difficult problems such as reproducing sound waves from videos (need source) and a sufficiently complex neural network might provide an excellent solution in this case as well. The purpose of the paper is to apply multi-layer long short-term memory neural networks to this machine language translation problem and assess the accuracy in translation for this approach.<br />
<br />
= Model =<br />
=== Long Short-Term Memory Recurrent Neural Network (LSTM) ===<br />
Recurrent neural networks are a variation of deep neural networks that are capable of storing information about previous hidden states in special memory layers. Unlike feed forward neural networks that take in a single fixed length vector input and output a fixed length vector output, recurrent neural networks can take in a sequence of fixed length vectors as input because of their ability to store information and maintain a connection between inputs through this memory layer. By comparison, previous inputs would have no impact on current output for feed forward neural networks whereas they can impact current input in a recurrent neural network.<br />
<br />
<br />
This form of input fits naturally with language translation since sentences are sequences of words and many problems regarding representing variable length sentences as fixed length vectors can be avoided. However, training recurrent neural networks to learn long time lag dependencies where inputs many time steps back can heavily influence current output is difficult and generally results in exploding or vanishing gradients. A variation of recurrent neural networks, long short-term memory neural network, was used instead for this paper as they do not suffer as much from vanishing gradient problem.<br />
<br />
<br />
The purpose of LSTM in this case is to estimate the conditional probability of the output sequence, <math>\,(y_1,\cdots,y_{T'})</math>, based on the input sequence, <math>\,(x_1,\cdots,x_{T})</math>, where <math>\,T</math> does not have to equal <math>\,T'</math><br />
<br />
<br />
Let <math>\,v</math> represent the state of hidden layers after <math>\,(x_1,\cdots,x_{T})</math> have been inputted into the LSTM, i.e. what has been stored in the neural network's memory, then<br />
<br />
<math>\,p(y_1,\cdots,y_{T'}|x_1,\cdots,x_{T})=\prod_{t=1}^{T'} p(y_t|v,y_1,\cdots,y_{t-1})</math><br />
<br />
For each <math>\,p(y_t|v,y_1,\cdots,y_{t-1})</math>, The LSTM neural network at point <math>\,t</math> after <math>\,(x_1,\cdots,x_T,y_1,\cdots,y_{t-1})</math> have been inputted would output the relative probability of each word in the vocabulary and softmax function, <math>\,\frac{e^{x_b}}{\sum_{t=1}^N e^{x_t}}\,</math> can be applied to this output vector to generate the corresponding probability. From this, we can calculate any <math>\,p(y_1,\cdots,y_{T'}|x_1,\cdots,x_{T})</math> by repeatedly adding <math>\,y_t</math> as input into the LSTM neural network to calculate the new set of probabilities.<br />
<br />
The objective function used<br />
<br />
=== Input and Output Data Transformation ===<br />
About 12 million English-French sentence pairs were used during the training with a vocabulary of 160,000 for English and 80,000 for French. Any unknown words were replaced with a special token. All sentences were attached with an <EOS> token to indicate end of sentence.<br />
<br />
Additionally, input sentences were entered backwards as the researchers found this significantly increased accuracy. For example, using the sentence "Today I went to lectures.", the input order would be "lectures,to,went,I,Today". They suspect this is due to reduction of time lag between the beginning of each sentence.<br />
<br />
To decode a translation after training, a simple left to right beam search algorithm is used. This process goes as follows, a small number of initial translations with highest probabilities are picked at the start. Each translation is then re-entered into the LSTM independently and a new small set of words with highest probabilities are appended to the end of each translation. This repeats until <EOS> token is chosen and the completely translated sentence is added to the final translation set which is then ranked and highest ranking translation chosen. <br />
<br />
<br />
= Training and Results =<br />
=== Training Method ===<br />
Two LSTM neural networks were used overall; one to generate a fixed vector representation from the input sequence and another to generate the output sequence from the fixed vector representation. Each neural network had 4 layers and 1000 cells per layer and <math>\,v</math> can be represented by the 8000 real numbers in each cell's memory after the input sequence has been entered. Stochastic gradient descent with a batch size of 128 and learning rate of 0.7 was used. Initial parameters were set using a uniform distribution between -0.08 and 0.08.<br />
=== Results ===</div>Rtwang