http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Alcateri&feedformat=atomstatwiki - User contributions [US]2024-03-28T17:34:57ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=proposal_for_STAT946_(Deep_Learning)_final_projects_Fall_2015&diff=27351proposal for STAT946 (Deep Learning) final projects Fall 20152015-12-18T01:57:20Z<p>Alcateri: </p>
<hr />
<div>'''Project 0:''' (This is just an example)<br />
<br />
'''Group members:'''first name family name, first name family name, first name family name<br />
<br />
'''Title:''' Sentiment Analysis on Movie Reviews<br />
<br />
''' Description:''' The idea and data for this project is taken from http://www.kaggle.com/c/sentiment-analysis-on-movie-reviews.<br />
Sentiment analysis is the problem of determining whether a given string contains positive or negative sentiment. For example, “A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story” contains negative sentiment, but it is not immediately clear which parts of the sentence make it so.<br />
This competition seeks to implement machine learning algorithms that can determine the sentiment of a movie review<br />
<br />
'''Project 1:'''<br />
<br />
'''Group members:''' Sean Aubin, Brent Komer<br />
<br />
'''Title:''' Convolution Neural Networks in SLAM<br />
<br />
''' Description:''' We will try to replicate the results reported in [http://arxiv.org/abs/1411.1509 Convolutional Neural Networks-based Place Recognition] using [http://caffe.berkeleyvision.org/ Caffe] and [http://arxiv.org/abs/1409.4842 Google-net]. As a "stretch" goal, we will try to convert the CNN to a spiking neural network (a technique created by Eric Hunsberger) for greater biological plausibility and easier integration with other cognitive systems using Nengo. This work will help Brent with starting his PHD investigating cognitive localisation systems and object manipulation.<br />
<br />
'''Project 2:'''<br />
<br />
'''Group members:''' Xinran Liu, Fatemeh Karimi, Deepak Rishi & Chris Choi<br />
<br />
'''Title:''' Image Classification with Deep Learning<br />
<br />
''' Description:''' Our aim is to participate in the Digital Recognizer Kaggle Challenge, where one has to correctly classify the Modified National Institute of Standards and Technology (MNIST) dataset of handwritten numerical digits. For our first approach we propose using a simple Feed-Forward Neural Network to form a baseline for comparison. We then plan on experimenting on different aspects of a Neural Network such as network architecture, activation functions and incorporate a wide variety of training methods.<br />
<br />
'''Project 3'''<br />
<br />
'''Group members:''' Ri Wang, Maysum Panju, Mahmood Gohari<br />
<br />
'''Title:''' Machine Translation Using Neural Networks<br />
<br />
'''Description:''' The goal of this project is to translate languages using different types of neural networks and the algorithms described in "Sequence to sequence learning with neural networks." and "Neural machine translation by jointly learning to align and translate". Different vector representations for input sentences (word frequency, Word2Vec, etc) will be used and all combinations of algorithms will be ranked in terms of accuracy.<br />
Our data will mainly be from [http://www.statmt.org/europarl/ Europarl] and [https://tatoeba.org/eng Tatoeba]. The common target language will be English to allow for easier judgement of translation quality.<br />
<br />
'''Project 4'''<br />
<br />
'''Group members:''' Peter Blouw, Jan Gosmann<br />
<br />
'''Title:''' Using Structured Representations in Memory Networks to Perform Question Answering<br />
<br />
'''Description:''' Memory networks are machine learning systems that combine memory and inference to perform tasks that involve sophisticated reasoning (see [http://arxiv.org/pdf/1410.3916.pdf here] and [http://arxiv.org/pdf/1502.05698v7.pdf here]). Our goal in this project is to first implement a memory network that replicates prior performance on the bAbl question-answering tasks described in [http://arxiv.org/pdf/1502.05698v7.pdf Weston et al. (2015)]. Then, we hope to improve upon this baseline performance by using more sophisticated representations of the sentences that encode questions being posed to the network. Current implementations often use a bag of words encoding, which throws out important syntactic information that is relevant to determining what a particular question is asking. As such, we will explore the use of things like POS tags, n-gram information, and parse trees to augment memory network performance.<br />
<br />
'''Project 5'''<br />
<br />
'''Group members:''' Tim Tse<br />
<br />
'''Title:''' The Allen AI Science Challenge<br />
<br />
'''Description:''' The goal of this project is to create an artificial intelligence model that can answer multiple-choice questions on a grade 8 science exam, with a success rate better than the best 8th graders. This will involve a deep neural network as the underlying model, to help parse the large amount of information needed to answer these questions. The model should also learn, over time, how to make better answers by acquiring more and more data. This is a Kaggle challenge, and the link to the challenge is [https://www.kaggle.com/c/the-allen-ai-science-challenge here]. The data to produce the model will come from the Kaggle website.<br />
<br />
'''Project 6''' <br />
<br />
'''Group members:''' Valerie Platsko<br />
<br />
'''Title:''' Classification for P300-Speller Using Convolutional Neural Networks <br />
<br />
''' Description:''' The goal of this project is to replicate (and possibly extend) the results in [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5492691 Convolutional Neural Networks for P300 Detection with Application to Brain-Computer Interfaces], which used convolutional neural networks to recognize P300 responses in recorded EEG and additionally to correctly recognize attended targets.(In the P300-Speller application, letters flash in rows and columns, so a single P300 response is associated with multiple potential targets.) The data in the paper came from http://www.bbci.de/competition/iii/ (dataset II), and there is an additional P300 Speller dataset available from [http://www.bbci.de/competition/ii/ a previous version of the competition].<br />
<br />
'''Project 7''' <br />
<br />
'''Group members:''' Amirreza Lashkari, Derek Latremouille, Rui Qiao and Luyao Ruan<br />
<br />
'''Title:''' What's Cooking?<br />
<br />
''' Description:''' Although the best way to distinguish different types of cuisine is to smell and taste, our goal is to predict the type of a cuisine according to its ingredients. Since, the data is text-based, different methods will be used first to get appropriate transformed data for various classification techniques. Different deep neural network algorithms will then be implemented and we will compare their accuracy and complexity. This is a Kaggle challenge (see [https://www.kaggle.com/c/whats-cooking here]).<br />
<br />
'''Project 8'''<br />
<br />
'''Group members:''' Abdullah Rashwan and Priyank Jaini<br />
<br />
'''Title:''' Learning the Parameters for Continuous Distribution Sum-Product Networks using Bayesian Moment Matching<br />
<br />
'''Description:''' Sum-Product Networks have generated interest due to their ability to do exact inference in linear time with respect to the size of the network. Parameter learning however still is a problem. We have proposed an online Bayesian Moment Matching algorithm to learn the parameters for discrete distributions, in this work, we are extending the algorithm to learn the parameters for continuous distributions as well.<br />
<br />
'''Project 9'''<br />
<br />
'''Group members:''' Anthony Caterini<br />
<br />
'''Title:''' Critical Analysis of the Manifold Tangent Classifier<br />
<br />
'''Description:''' This project aims to thoroughly analyze the [http://papers.nips.cc/paper/4409-the-manifold-tangent-classifier.pdf Manifold Tangent Classifier]. The goal of this project is to implement the classifier as in the paper, and to attempt to formalize some of the geometric interpretation of the algorithm's formulation.</div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=generating_text_with_recurrent_neural_networks&diff=27296generating text with recurrent neural networks2015-12-14T18:47:42Z<p>Alcateri: /* Discussion */ Cleaned up the discussion and added a new idea about combining character and word level models.</p>
<hr />
<div>= Introduction =<br />
<br />
The goal of this paper is to introduce a new type of recurrent neural network for character-level language modelling that allows the input character at a given timestep to multiplicatively gate the connections that make up the hidden-to-hidden layer weight matrix. The paper also introduces a solution to the problem of vanishing and exploding gradients by applying a technique called Hessian-Free optimization to effectively train a recurrent network that, when unrolled in time, has approximately 500 layers. At the date of publication, this network was arguably the deepest neural network ever trained successfully. <br />
<br />
Strictly speaking, a language model is a probability distribution over sequences of words or characters, and such models are typically used to predict the next character or word in a sequence given some number of preceding characters or words. Recurrent neural networks are naturally applicable to this task, since they make predictions based on a current input and a hidden state whose value is determined by some number of previous inputs. Alternative methods that the authors compare their results to include a hierarchical Bayesian model called a 'sequence memoizer' <ref> Wood, F., C. Archambeau, J. Gasthaus, L. James, and Y.W. The. [http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/airg/readings/2012_02_28_a_stochastic_memoizer.pdf "A Stochastic Memoizer for Sequence Data"] ICML, (2009) </ref> and a mixture of context models referred to as PAQ <ref> Mahoney, M. [https://repository.lib.fit.edu/bitstream/handle/11141/154/cs-2005-16.pdf?sequence=1&isAllowed=y "Adaptive Weighing of Context Models for Lossless Data Compression"], Florida Institute of Technology Technical Report, (2005) </ref>, which actually includes word-level information (rather strictly character-level information). The multiplicative RNN introduced in this paper improves on the state-of-the-art for solely character-level language modelling, but is somewhat worse than the state-of-the-art for text compression. <br />
<br />
To give a brief review, an ordinary recurrent neural network is parameterized by three weight matrices, <math>\ W_{hi} </math>, <math>\ W_{hh} </math>, and <math>\ W_{oh} </math>, and functions to map a sequence of <math> N </math> input states <math>\ [i_1, ... , i_N] </math> to a sequence of hidden states <math>\ [h_1, ... , h_N] </math> and a sequence of output states <math>\ [o_1, ... , o_N] </math>. The matrix <math>\ W_{hi} </math> parameterizes the mapping from the current input state to the current hidden state, while the matrix <math>\ W_{hh} </math> parameterizes the mapping from the previous hidden state to current hidden state, such that the current hidden state is function of the previous hidden state and the current input state. Finally, the matrix <math>\ W_{oh} </math> parameterizes the mapping from the current hidden state to the current output state. So, at a given timestep <math>\ t </math>, the values of the hidden state and output state are as follows:<br />
<br />
<br />
:<math>\ h_t = \tanh(W_{hi}i_t + W_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
:<math>\ o_t = W_{oh}h_t + b_o </math> <br />
<br />
<br />
where <math>\ b_o</math> and <math>\ b_h</math> are bias vectors. Typically, the output state is converted into a probability distribution over characters or words using the softmax function. The network can then be treated as a generative model of text by sampling from this distribution and providing the sampled output as the input to the network at the next timestep.<br />
<br />
Recurrent networks are known to be very difficult to train due to the existence a highly unstable relationship between a network's parameters and the gradient of its cost function. Intuitively, the surface of the cost function is intermittently punctuated by abrupt changes (giving rise to exploding gradients) and nearly flat plateaus (giving rise to vanishing gradients) that can effectively become poor local minima when a network is trained through gradient descent. Techniques for improving training include the use of Long Short-Term Memory networks <ref> Hochreiter, Sepp, and Jürgen Schmidhuber. [http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf "Long short-term memory."] Neural computation 9.8 (1997): 1735-1780. </ref>, in which memory units are used to selectively preserve information from previous states, and the use of Echo State networks, <ref> Jaeger, H. and H. Haas. [http://www.sciencemag.org/content/304/5667/78.short "Harnassing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication."] Science, 204.5667 (2004): 78-80. </ref> which learn only the output weights on a network with recurrent connections that implement a wide range of time-varying patterns. In this paper, the method of Hessian free optimization is used instead of these alternatives. <br />
<br />
[[File:RNN.png | frame | centre | A depiction of a recurrent neural network unrolled through three time steps.]]<br />
<br />
= Hessian-Free Optimization = <br />
<br />
While this optimization technique is described elsewhere in Martens (2010) <ref> Martens, J. [http://icml2010.haifa.il.ibm.com/papers/458.pdf "Deep learning via Hessian-free optimization."] ICML, (2010) </ref><br />
, its use is essential to obtaining the successful results reported in this paper. In brief, the technique involves uses information about the 2nd derivatives of the cost function to perform more intelligent parameter updates. This information is helpful because in cases where the gradient is changing very slowly on a particular dimension, it is more efficient to take larger steps in the direction of descent along this dimension. Alternatively, if the the gradient is changing very rapidly on a particular dimension, then it makes sense to take smaller steps to avoid 'bouncing' off of a step incline in the cost function and moving to a less desirable location in parameter space. The relevant 2nd order information is computed using the method of finite differences to avoid computing the Hessian of the cost function. <br />
<br />
What is important about this technique is that it provides a solution to problem of vanishing and exploding gradients during the training of recurrent neural networks. Vanishing gradients are accommodated by descending much more rapidly along the cost function in areas where it has relatively low curvature (e.g., when the cost function is nearly flat), while exploding gradients are accommodated by descending much more slowly along the cost function in areas where it has relatively high curvature (e.g., when there is a steep cliff). The figure below illustrates how hessian free optimization improves the training of neural networks in general. <br />
<br />
[[File:HFF.png | frame | centre | On the left is training with naive gradient descent, and on the right is training via the use of 2nd order information about the cost function.]]<br />
<br />
= Multiplicative Recurrent Neural Networks = <br />
<br />
The authors report that using a standard neural network trained via Hessian-free optimization produces only mediocre results. As such, they introduce a new architecture called a multiplicative recurrent neural network (MRNN). The motivating intuition behind this architecture is that the input at a given time step should both additively contribute to the hidden state (though the mapping performed by the input-to-hidden weights) and additionally determine the weights on the recurrent connections to the hidden state. This approach came from viewing an RNN as a model of an tree in which each node is a hidden state vector and each edge is labelled by a character that determines how the parent node gives rise to the child node. In other words, the idea is to define a unique weight matrix <math>\ W_{hh} </math> for each possible input. The reason this design is hypothesized to the improve the predictive adequacy of the model is due to the idea that the ''conjunction'' of the input at one time step and the hidden state at the previous time step is important. Capturing this conjunction requires the input to influence the contribution of the previous hidden state to the current hidden state. Otherwise, the previous hidden state and the current input will make entirely independent contributions to the calculation of the current hidden state. Formally, this changes the calculation of the hidden state at a given time step as follows:<br />
<br />
<br />
:<math>\ h_t = tanh(W_{hi}i_t + W^{i_t}_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
where <math>\ W^{i_t}_{hh} </math> is an input-specific hidden-to-hidden weight matrix. As a first approach to implementing this MRNN, the authors suggest using a tensor of rank 3 to store the hidden-to-hidden weights. The idea is that the tensor stores one weight matrix per possible input; when the input is provided as a one-hot vector, tensor contraction (i.e. a generalization of matrix multiplication) can be used to extract the 'slice' of the tensor that contains the appropriate set of weights. One problem with this approach is that it quickly becomes impractical to store the hidden-to-hidden weights as a tensor if the dimensionality of the hidden state has a large number of dimensions. For instance, if a network's hidden layer encodes a vector with 1000 dimensions, then the number of parameters in the tensor that need to be learned will be equal to <math>\ 1000^2 * N </math>, where <math>\ N </math> is the vocabulary size. In short, this method will add many millions of parameters to a model for a non-trivially sized vocabulary. <br />
<br />
To fix this problem, the tensor is factored using a technique described in Taylor & Hinton (2009) <ref>Taylor, G. and G. Hinton. [http://www.cs.toronto.edu/~fritz/absps/fcrbm_icml.pdf "Factored Conditional Restricted Boltzmann Machines for Modeling Motion Style"] ICML (2009) </ref>. The idea is to define three matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math> that approximate the use of a tensor in determining the value of <math>\ W^{i_t}_{hh} </math> as follows:<br />
<br />
<br />
:<math>\ W^{i_t}_{hh} = W_{hf} \cdot diag(W_{fi}i_t) \cdot W_{fh} </math><br />
<br />
<br />
Intuitively, this factorization produces two vectors from the current input state and the previous hidden state, takes their element-wise product, and applies a linear transformation to produce the input to the hidden layer at the current timestep. The triangle units in the figure below indicate where the element-wise product occurs, and the connections into and out of these units are parameterized by the matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math>. The element-wise multiplication is implemented by diagonalizing the matrix-vector product <math>\ W_{fi}i_t </math>, and if the dimensionality of this matrix-vector product (i.e. the dimensionality of the layer of multiplicative units) is allowed to be arbitrarily large, then this factorization is just as expressive as using a tensor to store the hidden-to-hidden weights. <br />
<br />
[[File:MRNN.png | frame | centre | A depiction of a multiplicative recurrent neural network unrolled through three time steps.]]<br />
<br />
In the experiments described below, an MRNN is trained via Hessian Free optimization on sequences of 250 characters. The first 50 characters used to condition the hidden state, so only 200 predictions are generated per sequence. 1500 hidden units were used, along with 1500 factors (i.e. multiplicative gates, or the triangles in the figure above), yielding an unrolled network of 500 layers if the multiplicative units are treated as forming a layer. Training was performed with a parallelized system consisting of 8 GPUs. A vocabulary of 86 characters was used in all cases.<br />
<br />
= Quantitative Experiments =<br />
<br />
To compare the performance of the MRNN to that of the sequence memorizer and PAQ, three 100mb datasets were used: a selection of wikipedia articles, a selection of New York Times articles, and a corpus of all available articles published in NIPS and JMLR. The last 10 million characters in each dataset were held out for testing. Additionally, the MRNN was trained on the larger corpora from which the wikipedia text and NYT articles were drawn (i.e. all of wikipedia, and the entire set of NYT articles). <br />
<br />
The models were evaluated by calculating the number of bits per character achieved by each model on the 3 test sets. This metric is essentially a measure of model perplexity, which defines how well a given model predicts the data it is being tested on. If the number of bits per character is high, this means that the model is, on average, highly uncertain about the value of each character in the test set. If the number of bits per character is low, then the model is less uncertain about the value of each character in the test set. One way to think about this quantity is as the average amount of additional information (in bits) needed by the model to exactly identify the value of each character in the test set. So, a lower measure is better, indicating that the model achieves a good representation of the underlying data. (it is sometimes helpful to think of a language model as a compressed representation of a text corpus). <br />
<br />
As illustrated in the table below, the MRNN achieves a lower number of bits per character than the hierarchical bayesian model, but a higher number of bits per character than the PAQ model (which recall, is not a strictly character level model). The numbers in brackets indicate the bits per character achieved on the training data, and the column labelled 'Full Set' reports the results of training the MRNN on the full wikipedia and NYT corpora. <br />
<br />
[[File:bits.png | frame | centre | Bits per character achieved by each model on each dataset.]]<br />
<br />
These results indicate that the MRNN beat the existing state-of-the-art for pure character-level language modelling at the time of publication. <br />
<br />
= Qualitative Experiments =<br />
<br />
By examining the output of the MRNN, it is possible to see what kinds of linguistic patterns it is able to learn. Most striking is the fact that the model consistently produces correct words from a fairly sophisticated vocabulary. The model is also able to balance parentheses and quotation marks over many time steps, and it occasionally produces plausible non-words such as 'cryptoliation' and 'homosomalist'. The text in the figure below was produced by running the model in generative mode less than 10 times using the phrase 'The meaning of life is' as an initial input, and then selecting the most interesting output sequence. The model was trained on wikipedia to produce the results in the figure below. The character '?' indicates an unknown item, and some of the spacing and punctuation oddities are due to preprocessing and are apparently common in the dataset. <br />
<br />
[[File:text.png | frame | centre | A selection of text generated by an MRNN initialized with the sequence "The meaning of life is...".]]<br />
<br />
Another interesting qualitative demonstration of the model's abilities involves initializing the model with a more complicated sequence and seeing what sort of continuations it produces. In the figure below, a number of sampled continuations of the phrase 'England, Spain, France, Germany' are shown. Generally, the model is able to provide continuations that preserve the list-like structure of the phrase. Moreover, the model is also able to recognize that the list is a list of locations, and typically offers additional locations as its predicted continuation of the sequence. <br />
<br />
[[File:locations.png | frame | centre | Selections of text generated by an MRNN initialized with the sequence "England, Spain, France, Germany".]]<br />
<br />
What is particularly impressive about these results is the fact that the model is learning a distribution over sequences of characters only. From this distribution, a broad range of syntactic and lexical knowledge emerges. It is also worth noting that it is much more efficient to train a model with a small character-level vocabulary than it is to train a model with a word-level vocabulary (which can have tens of thousands of items). As such, the character-level MRNN is able to scale to large datasets quite well.<br />
<br />
Moreover, they find that the MRNN is sensitive to some notations like the initial bracket if such string doesn't occur in the training set. They claim that any method which is based on precise context matches is fundamentally incapable of utilizing long contexts, because the probability that a long context occurs more than once is very small.<br />
<br />
= Discussion =<br />
<br />
One aspect of this work that is worth considering concerns the degree to which the use of input-dependent gating of the information being passed between hidden states actually improves the results above the use of a standard recurrent neural network. Presumably, the use of hessian-free optimization allows such a network to be successfully trained, so it would be helpful to compare with the results obtained using an MRNN. MRNNs already learn surprisingly good language models using only 1500 hidden units, and unlike other approaches such as the sequence memorizer and PAQ, they are easy to extend along various dimensions. Otherwise, it is hard to discern the relative importance of the optimization technique and the network architecture in achieving the good language modeling results reported in this paper.<br />
<br />
The MRNN assigns probability to plausible words that do not exist in the training set. This is a good property that enabled the MRNN to deal with real words that it did not see in the training set. One advantage of this model is that it avoids using a huge softmax over all known words by predicting the next word based on a sequence of character predictions, while some word-level language models actually make up binary spellings of words in a way that they can predict them one bit at each time.<br />
<br />
Although this paper is strictly concerned with producing very strong character-level models, I think it would be interesting to somehow combine the results of a character-level and word-level RNN into a larger text-generation model. The text generated by this model appears to perform well with grammar and on words that are unseen, but it seems to struggle with carrying a "train of thought". The sentences it generates are not very sensible in most cases, and a higher-level word model might be able to control some of that. However, I am not sure how to combine the word level with the character model, but a convolutional approach could certainly work.<br />
<br />
= Bibliography = <br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=generating_text_with_recurrent_neural_networks&diff=27295generating text with recurrent neural networks2015-12-14T18:34:06Z<p>Alcateri: /* Introduction */</p>
<hr />
<div>= Introduction =<br />
<br />
The goal of this paper is to introduce a new type of recurrent neural network for character-level language modelling that allows the input character at a given timestep to multiplicatively gate the connections that make up the hidden-to-hidden layer weight matrix. The paper also introduces a solution to the problem of vanishing and exploding gradients by applying a technique called Hessian-Free optimization to effectively train a recurrent network that, when unrolled in time, has approximately 500 layers. At the date of publication, this network was arguably the deepest neural network ever trained successfully. <br />
<br />
Strictly speaking, a language model is a probability distribution over sequences of words or characters, and such models are typically used to predict the next character or word in a sequence given some number of preceding characters or words. Recurrent neural networks are naturally applicable to this task, since they make predictions based on a current input and a hidden state whose value is determined by some number of previous inputs. Alternative methods that the authors compare their results to include a hierarchical Bayesian model called a 'sequence memoizer' <ref> Wood, F., C. Archambeau, J. Gasthaus, L. James, and Y.W. The. [http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/airg/readings/2012_02_28_a_stochastic_memoizer.pdf "A Stochastic Memoizer for Sequence Data"] ICML, (2009) </ref> and a mixture of context models referred to as PAQ <ref> Mahoney, M. [https://repository.lib.fit.edu/bitstream/handle/11141/154/cs-2005-16.pdf?sequence=1&isAllowed=y "Adaptive Weighing of Context Models for Lossless Data Compression"], Florida Institute of Technology Technical Report, (2005) </ref>, which actually includes word-level information (rather strictly character-level information). The multiplicative RNN introduced in this paper improves on the state-of-the-art for solely character-level language modelling, but is somewhat worse than the state-of-the-art for text compression. <br />
<br />
To give a brief review, an ordinary recurrent neural network is parameterized by three weight matrices, <math>\ W_{hi} </math>, <math>\ W_{hh} </math>, and <math>\ W_{oh} </math>, and functions to map a sequence of <math> N </math> input states <math>\ [i_1, ... , i_N] </math> to a sequence of hidden states <math>\ [h_1, ... , h_N] </math> and a sequence of output states <math>\ [o_1, ... , o_N] </math>. The matrix <math>\ W_{hi} </math> parameterizes the mapping from the current input state to the current hidden state, while the matrix <math>\ W_{hh} </math> parameterizes the mapping from the previous hidden state to current hidden state, such that the current hidden state is function of the previous hidden state and the current input state. Finally, the matrix <math>\ W_{oh} </math> parameterizes the mapping from the current hidden state to the current output state. So, at a given timestep <math>\ t </math>, the values of the hidden state and output state are as follows:<br />
<br />
<br />
:<math>\ h_t = \tanh(W_{hi}i_t + W_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
:<math>\ o_t = W_{oh}h_t + b_o </math> <br />
<br />
<br />
where <math>\ b_o</math> and <math>\ b_h</math> are bias vectors. Typically, the output state is converted into a probability distribution over characters or words using the softmax function. The network can then be treated as a generative model of text by sampling from this distribution and providing the sampled output as the input to the network at the next timestep.<br />
<br />
Recurrent networks are known to be very difficult to train due to the existence a highly unstable relationship between a network's parameters and the gradient of its cost function. Intuitively, the surface of the cost function is intermittently punctuated by abrupt changes (giving rise to exploding gradients) and nearly flat plateaus (giving rise to vanishing gradients) that can effectively become poor local minima when a network is trained through gradient descent. Techniques for improving training include the use of Long Short-Term Memory networks <ref> Hochreiter, Sepp, and Jürgen Schmidhuber. [http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf "Long short-term memory."] Neural computation 9.8 (1997): 1735-1780. </ref>, in which memory units are used to selectively preserve information from previous states, and the use of Echo State networks, <ref> Jaeger, H. and H. Haas. [http://www.sciencemag.org/content/304/5667/78.short "Harnassing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication."] Science, 204.5667 (2004): 78-80. </ref> which learn only the output weights on a network with recurrent connections that implement a wide range of time-varying patterns. In this paper, the method of Hessian free optimization is used instead of these alternatives. <br />
<br />
[[File:RNN.png | frame | centre | A depiction of a recurrent neural network unrolled through three time steps.]]<br />
<br />
= Hessian-Free Optimization = <br />
<br />
While this optimization technique is described elsewhere in Martens (2010) <ref> Martens, J. [http://icml2010.haifa.il.ibm.com/papers/458.pdf "Deep learning via Hessian-free optimization."] ICML, (2010) </ref><br />
, its use is essential to obtaining the successful results reported in this paper. In brief, the technique involves uses information about the 2nd derivatives of the cost function to perform more intelligent parameter updates. This information is helpful because in cases where the gradient is changing very slowly on a particular dimension, it is more efficient to take larger steps in the direction of descent along this dimension. Alternatively, if the the gradient is changing very rapidly on a particular dimension, then it makes sense to take smaller steps to avoid 'bouncing' off of a step incline in the cost function and moving to a less desirable location in parameter space. The relevant 2nd order information is computed using the method of finite differences to avoid computing the Hessian of the cost function. <br />
<br />
What is important about this technique is that it provides a solution to problem of vanishing and exploding gradients during the training of recurrent neural networks. Vanishing gradients are accommodated by descending much more rapidly along the cost function in areas where it has relatively low curvature (e.g., when the cost function is nearly flat), while exploding gradients are accommodated by descending much more slowly along the cost function in areas where it has relatively high curvature (e.g., when there is a steep cliff). The figure below illustrates how hessian free optimization improves the training of neural networks in general. <br />
<br />
[[File:HFF.png | frame | centre | On the left is training with naive gradient descent, and on the right is training via the use of 2nd order information about the cost function.]]<br />
<br />
= Multiplicative Recurrent Neural Networks = <br />
<br />
The authors report that using a standard neural network trained via Hessian-free optimization produces only mediocre results. As such, they introduce a new architecture called a multiplicative recurrent neural network (MRNN). The motivating intuition behind this architecture is that the input at a given time step should both additively contribute to the hidden state (though the mapping performed by the input-to-hidden weights) and additionally determine the weights on the recurrent connections to the hidden state. This approach came from viewing an RNN as a model of an tree in which each node is a hidden state vector and each edge is labelled by a character that determines how the parent node gives rise to the child node. In other words, the idea is to define a unique weight matrix <math>\ W_{hh} </math> for each possible input. The reason this design is hypothesized to the improve the predictive adequacy of the model is due to the idea that the ''conjunction'' of the input at one time step and the hidden state at the previous time step is important. Capturing this conjunction requires the input to influence the contribution of the previous hidden state to the current hidden state. Otherwise, the previous hidden state and the current input will make entirely independent contributions to the calculation of the current hidden state. Formally, this changes the calculation of the hidden state at a given time step as follows:<br />
<br />
<br />
:<math>\ h_t = tanh(W_{hi}i_t + W^{i_t}_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
where <math>\ W^{i_t}_{hh} </math> is an input-specific hidden-to-hidden weight matrix. As a first approach to implementing this MRNN, the authors suggest using a tensor of rank 3 to store the hidden-to-hidden weights. The idea is that the tensor stores one weight matrix per possible input; when the input is provided as a one-hot vector, tensor contraction (i.e. a generalization of matrix multiplication) can be used to extract the 'slice' of the tensor that contains the appropriate set of weights. One problem with this approach is that it quickly becomes impractical to store the hidden-to-hidden weights as a tensor if the dimensionality of the hidden state has a large number of dimensions. For instance, if a network's hidden layer encodes a vector with 1000 dimensions, then the number of parameters in the tensor that need to be learned will be equal to <math>\ 1000^2 * N </math>, where <math>\ N </math> is the vocabulary size. In short, this method will add many millions of parameters to a model for a non-trivially sized vocabulary. <br />
<br />
To fix this problem, the tensor is factored using a technique described in Taylor & Hinton (2009) <ref>Taylor, G. and G. Hinton. [http://www.cs.toronto.edu/~fritz/absps/fcrbm_icml.pdf "Factored Conditional Restricted Boltzmann Machines for Modeling Motion Style"] ICML (2009) </ref>. The idea is to define three matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math> that approximate the use of a tensor in determining the value of <math>\ W^{i_t}_{hh} </math> as follows:<br />
<br />
<br />
:<math>\ W^{i_t}_{hh} = W_{hf} \cdot diag(W_{fi}i_t) \cdot W_{fh} </math><br />
<br />
<br />
Intuitively, this factorization produces two vectors from the current input state and the previous hidden state, takes their element-wise product, and applies a linear transformation to produce the input to the hidden layer at the current timestep. The triangle units in the figure below indicate where the element-wise product occurs, and the connections into and out of these units are parameterized by the matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math>. The element-wise multiplication is implemented by diagonalizing the matrix-vector product <math>\ W_{fi}i_t </math>, and if the dimensionality of this matrix-vector product (i.e. the dimensionality of the layer of multiplicative units) is allowed to be arbitrarily large, then this factorization is just as expressive as using a tensor to store the hidden-to-hidden weights. <br />
<br />
[[File:MRNN.png | frame | centre | A depiction of a multiplicative recurrent neural network unrolled through three time steps.]]<br />
<br />
In the experiments described below, an MRNN is trained via Hessian Free optimization on sequences of 250 characters. The first 50 characters used to condition the hidden state, so only 200 predictions are generated per sequence. 1500 hidden units were used, along with 1500 factors (i.e. multiplicative gates, or the triangles in the figure above), yielding an unrolled network of 500 layers if the multiplicative units are treated as forming a layer. Training was performed with a parallelized system consisting of 8 GPUs. A vocabulary of 86 characters was used in all cases.<br />
<br />
= Quantitative Experiments =<br />
<br />
To compare the performance of the MRNN to that of the sequence memorizer and PAQ, three 100mb datasets were used: a selection of wikipedia articles, a selection of New York Times articles, and a corpus of all available articles published in NIPS and JMLR. The last 10 million characters in each dataset were held out for testing. Additionally, the MRNN was trained on the larger corpora from which the wikipedia text and NYT articles were drawn (i.e. all of wikipedia, and the entire set of NYT articles). <br />
<br />
The models were evaluated by calculating the number of bits per character achieved by each model on the 3 test sets. This metric is essentially a measure of model perplexity, which defines how well a given model predicts the data it is being tested on. If the number of bits per character is high, this means that the model is, on average, highly uncertain about the value of each character in the test set. If the number of bits per character is low, then the model is less uncertain about the value of each character in the test set. One way to think about this quantity is as the average amount of additional information (in bits) needed by the model to exactly identify the value of each character in the test set. So, a lower measure is better, indicating that the model achieves a good representation of the underlying data. (it is sometimes helpful to think of a language model as a compressed representation of a text corpus). <br />
<br />
As illustrated in the table below, the MRNN achieves a lower number of bits per character than the hierarchical bayesian model, but a higher number of bits per character than the PAQ model (which recall, is not a strictly character level model). The numbers in brackets indicate the bits per character achieved on the training data, and the column labelled 'Full Set' reports the results of training the MRNN on the full wikipedia and NYT corpora. <br />
<br />
[[File:bits.png | frame | centre | Bits per character achieved by each model on each dataset.]]<br />
<br />
These results indicate that the MRNN beat the existing state-of-the-art for pure character-level language modelling at the time of publication. <br />
<br />
= Qualitative Experiments =<br />
<br />
By examining the output of the MRNN, it is possible to see what kinds of linguistic patterns it is able to learn. Most striking is the fact that the model consistently produces correct words from a fairly sophisticated vocabulary. The model is also able to balance parentheses and quotation marks over many time steps, and it occasionally produces plausible non-words such as 'cryptoliation' and 'homosomalist'. The text in the figure below was produced by running the model in generative mode less than 10 times using the phrase 'The meaning of life is' as an initial input, and then selecting the most interesting output sequence. The model was trained on wikipedia to produce the results in the figure below. The character '?' indicates an unknown item, and some of the spacing and punctuation oddities are due to preprocessing and are apparently common in the dataset. <br />
<br />
[[File:text.png | frame | centre | A selection of text generated by an MRNN initialized with the sequence "The meaning of life is...".]]<br />
<br />
Another interesting qualitative demonstration of the model's abilities involves initializing the model with a more complicated sequence and seeing what sort of continuations it produces. In the figure below, a number of sampled continuations of the phrase 'England, Spain, France, Germany' are shown. Generally, the model is able to provide continuations that preserve the list-like structure of the phrase. Moreover, the model is also able to recognize that the list is a list of locations, and typically offers additional locations as its predicted continuation of the sequence. <br />
<br />
[[File:locations.png | frame | centre | Selections of text generated by an MRNN initialized with the sequence "England, Spain, France, Germany".]]<br />
<br />
What is particularly impressive about these results is the fact that the model is learning a distribution over sequences of characters only. From this distribution, a broad range of syntactic and lexical knowledge emerges. It is also worth noting that it is much more efficient to train a model with a small character-level vocabulary than it is to train a model with a word-level vocabulary (which can have tens of thousands of items). As such, the character-level MRNN is able to scale to large datasets quite well.<br />
<br />
Moreover, they find that the MRNN is sensitive to some notations like the initial bracket if such string doesn't occur in the training set. They claim that any method which is based on precise context matches is fundamentally incapable of utilizing long contexts, because the probability that a long context occurs more than once is very small.<br />
<br />
= Discussion =<br />
<br />
One aspect of this work that is worth considering concerns the degree to which the use of input-dependent gating of the information being passed from hidden state to hidden state actually improves the results over and above the use of a standard recurrent neural network. Presumably, the use of hessian free optimization allows one to successfully train such a network, so it would be helpful to see a comparison to the results obtained using an MRNN.MRNNs already learn surprisingly good language models<br />
using only 1500 hidden units, and unlike other approaches such as the sequence memoizer and PAQ, they are easy to extend along various dimensions. Otherwise, it is hard to discern the relative importance of the optimization technique and the network architecture in achieving the good language modelling results reported in this paper.<br />
The MRNN assigns probability to plausible words that do not exist in the training set. This is a good property, that enabled the MRNN to deal with real words that it did not see in the training set. one advantage of this model is that, this model avoids using a huge softmax over all known words by predicting the next word based on a sequence of character predictions, while some word-level language models actually make up binary spellings of words in a way that they can predict them one bit at each time.<br />
<br />
= Bibliography = <br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=neural_Turing_Machines&diff=27294neural Turing Machines2015-12-14T18:12:32Z<p>Alcateri: /* Neural Turing Machines */ Cleaned up referencing and some wording</p>
<hr />
<div>= Neural Turing Machines =<br />
<br />
Even though recurrent neural networks (RNNs) are [https://en.wikipedia.org/wiki/Turing_completeness Turing complete] in theory, the control of logical flow and usage of external memory have been largely ignored in the machine learning literature. This might be due to the fact that the RNNs have to be wired properly to achieve the Turing completeness and this is not necessarily easy to achieve in practice. By adding an addressable memory, Graves et al. try to overcome this limitation and name their approach Neural Turing Machine (NTM) as analogy to [https://en.wikipedia.org/wiki/Turing_machine Turing machines] that are finite-state machines extended with an infinite memory<ref name="main">Graves, A., Wayne, G., & Danihelka, I. (2014). [http://arxiv.org/abs/1410.5401 Neural Turing Machines.] arXiv preprint arXiv:1410.5401.</ref>. Furthermore, every component of an NTM is differentiable, implying each component can be learned.<br />
<br />
== Theoretical Background == <br />
<br />
The authors state that the design of the NTM is inspired by past research spanning the disciplines of neuroscience, psychology, cognitive science and linguistics, and that the NTM can be thought of as a working memory system of the sort described in various accounts of cognitive architecture. However, the authors propose to ignore the known capacity limitations of working memory, and to introduce sophisticated gating and memory addressing operations that are typically absent in models of the sort developed throughout the computational neuroscience literature. <br />
<br />
With respect to historical precedents in cognitive science and linguistics literature, the authors relate their work to a longstanding debate concerning the effectiveness of neural networks for cognitive modeling. They present their work as advancing a line of research on encoding recursively-structured representations in neural networks that stemmed out of criticisms presented by Fodor and Pylyshyn in 1988 <ref name=fodor>Fodor, J. A., & Pylyshyn, Z. W. (1988). [http://www.sciantaanalytics.com/sites/default/files/fodor-pylyshyn.pdf Connectionism and cognitive architecture: A critical analysis.] Cognition, 28(1), 3-71.</ref> (though it is worth pointing out the authors give an incorrect summary of these criticisms - they state that Fodor and Pylyshyn argued that neural networks could not implement variable binding or perform tasks involving variable length structures, when in fact they argued that successful models of cognition require representations with constituent structure and processing mechanisms that strictly structure sensitive - see <ref name=fodor></ref> for details). The NTM is able to deal variable length inputs and arguably performs variable binding in the sense that the memory slots in the NTM can be treated as variables to which data is bound, but the authors do not revisit these issues in any detail after presenting the results of their simulations with the NTM.<br />
<br />
= Architecture =<br />
<br />
A Neural Turing Machine consists of a memory and a controller neural network. The controller receives input and produces output with help of the memory that is addressed with a content- and location based addressing mechanism. Figure 1 presents a high-level diagram of the NTM architecture.<br />
<br />
<center><br />
[[File:Pre_11.PNG | frame | center |Figure 1: Neural Turing Machine Architecture. During each update cycle, the controller network receives inputs from an external environment and emits outputs in response. It also reads to and writes from a memory matrix via a set of parallel read and write heads. The dashed line indicates the division between the NTM circuit and the outside world. ]]<br />
</center><br />
<br />
<br />
== Memory ==<br />
<br />
The memory at time <math>t</math> is given by an <math>N \times M</math> matrix <math>M_t</math>, where <math>N</math> is the number of memory locations and <math>M</math> the vector size at each memory location. To address memory locations for reading or writing an <math>N</math>-element vector <math>w_t</math> is used. The elements in this vector need to satisfy <math>0 \leq w_t(i) \leq 1</math> and have to sum to 1. Thus, it gives weighting of memory locations and the access to a location might be blurry.<br />
<br />
=== Reading ===<br />
<br />
Given an address <math>w_t</math> the read vector is just the weighted sum of memory locations:<br />
<br />
<math>r_t \leftarrow \sum_i w_t(i) M_t(i)</math><br />
<br />
which is clearly differentiable with respect to both the memory and the weighting.<br />
<br />
=== Writing ===<br />
<br />
The write process is split up into an erase and an add operation (inspired by the input and forget gates in LSTM). This allows the NTM to both overwrite or add to a memory location in a single time step. Otherwise it would be necessary to do a read for one of the operations first before the updated result can be written.<br />
<br />
The erase update is given by<br />
<br />
<math>\tilde{M}_t(i) \leftarrow M_{t-1}(i) [1 - w_t(i) e_t]</math><br />
<br />
with an <math>M</math>-element ''erase vector'' <math>e_t</math> with elements in the range <math>(0, 1)</math> selecting which vector elements to reset at the memory locations selected by <math>w_t</math>.<br />
<br />
Afterwords an ''add vector'' <math>a_t</math> is added according to<br />
<br />
<math>M_t(i) \leftarrow \tilde{M}_t(i) + w_t(i) a_t.</math><br />
<br />
The order in which the adds are performed by multiple heads is irrelevant. The combined erase and add operations of all the write heads produced the final content of the memory at time ''t''.<br />
<br />
== Addressing Mechanisms ==<br />
<br />
Two methods, content-based addressing and location-based addressing, are employed to generate the read/write weightings <math>w_t</math>. Depending on the task either mechanism can be more appropriate. These two methods of addressing are summarized in the figure below.<br />
<br />
[[File:flow_diagram_addressing_mechanism.JPG | center]]<br />
<br />
=== Content-based addressing ===<br />
<br />
For content-addressing, each head (whether employed for reading or writing) first produces a length <math>M</math> key vector <math>k_t</math> that is compared to each vector <math>M_t (i)</math> by a similarity measure <math>K[.,.]</math>. The content-based system produces a normalised weighting <math>w_c^t</math> based on the similarity and a positive key strength, <math>\beta_t</math>, which can amplify or attenuate the precision of the focus:<br />
<br />
<br />
<math><br />
w_c^t \leftarrow \frac{exp(\beta_t K[K_t,M_t(i)])}{\sum_{j} exp(\beta_t K[K_t,M_t(j)])}<br />
</math><br />
<br />
In this current implementation, the similarity measure is cosine similarity:<br />
<br />
<math><br />
K[u,v] = \frac{u.v}{||u||.||v||}<br />
</math><br />
<br />
=== Location-based addressing ===<br />
<br />
The location-based addressing mechanism is designed to facilitate both simple iterations across the locations of the memory and random-access jumps. It does so by implementing a rotational shift of a weighting. Prior to rotation, each head emits a scalar interpolation gate <math>g_t</math> in the range (0, 1). The value of <math>g</math> is used to blend between the weighting <math>w_{t-1}</math> produced by the head at the previous time-step and the weighting <math>w_t^c</math> produced by the content system at the current time-step, yielding the gated weighting <math>w_t^g</math> :<br />
<br />
<math><br />
w_t^g \leftarrow g_t w_t^c + (1-g_t) w_{t-1}<br />
</math><br />
<br />
After interpolation, each head emits a shift weighting <math>s_t</math> that defines a normalised distribution over the allowed integer shifts. Each element in this vector gives the degree by which different integer shifts are performed. For example, if shifts of -1, 0, 1 are allowed a (0, 0.3, 0.7) shift vector would denote a shift of 1 with strength 0.7 and a shift of 0 (no-shift) with strength 0.3. The actual shift is performed with a circular convolution<br />
<br />
<math>\tilde{w}_t(i) \leftarrow \sum_{j=0}^{N-1} w_t^g(j) s_t(i - j)</math><br />
<br />
where all index arithmetic is modulo N. This circular convolution can lead to blurring of the weights and <math>_t</math> will be sharpened with<br />
<br />
<math>w_t(i) \leftarrow \frac{\tilde{w}_t(i)^{\gamma_t}}{\sum_j<br />
\tilde{w}_t(j)^{\gamma_t}}</math><br />
<br />
where <math>_t 1</math> is an additional scalar outputted by the write head.<br />
<br />
== Controller ==<br />
<br />
The controller receives the external input and read head output and produces the addressing vectors and related values (for example shift weighting) for the read and write heads. It also produces an external output.<br />
<br />
Different types of controllers can be used. The paper discusses feed-forward and LSTM controllers. Feed-forward controllers are simpler, but are more limited than LSTM controllers since the type of operations they can perform are limited by the number of concurrent read and write heads. The LSTM controller, given it's internal register-like memory, does not suffer from this limitation.<br />
<br />
= Experiments =<br />
Team wanted to see if a network can be trained to copy sequences of length up to 20 could copy a sequence of length 100 with no further training. For all of the experiments, three architectures were compared: NTM with a feedforward controller, TML with an LSTM controller, and a standard LSTM network. All the tasks were supervised learning problems with binary targets; all networks had logistic sigmoid output layers and were trained with the cross-entropy objective function. Sequence prediction errors are reported in bits-per-sequence.<br />
<br />
= Results =<br />
<br />
The authors tested the NTM with a feed-forward and an LSTM controller against a pure LSTM on multiple tasks:<br />
<br />
* Copy Task: An input sequence has to reproduced.<br />
* Repeat Copy Task: An input sequence has to reproduced multiple times.<br />
* Associative Recall: After providing an input sequence the network is queried with one item of the sequence and has to produce the next.<br />
* Dynamic N-Grams: Predict the probability of the next bit being 0 or 1 given the last six bits.<br />
* Priority Sort: Sort an input sequence according to given priorities.<br />
<br />
<br />
[[File:copy_convergence.png|frame|center|Copy Task Learning Curve]]<br />
[[File:repeat_copy_convergence.png|frame|center|Repeat Copy Task Learning Curve]]<br />
[[File:recall_convergence.png|frame|center|Associative Recall Learning Curve]]<br />
[[File:ngrams_convergence.png|frame|center|Dynamic N-Grams Learning Curve]]<br />
[[File:sort_convergence.png|frame|center|Priority Sort Learning Curve]]<br />
<br />
[[File:ntm_feedforward_settings.png|frame|center|NTM with Feedforard Controller Experimental Settings]]<br />
[[File:ntm_ltsm_settings.png|frame|center|NTM with LTSM Controller Experimental Settings]]<br />
[[File:ltsm_settings.png|frame|center|LTSM Controller Experimental Settings]]<br />
<br />
In all tasks the combination of NTM with Feedforward or LTSM converges faster and obtains better generalization than a pure LSTM controller.<br />
<br />
<br />
= Discussion =<br />
* While the experimental results show great promise of the NTM architecture, the paper could be serviced with a more in-depth experimental results discussion as to why NTM performs very well when combined with Feedforward or LTSM compared to a pure LTSM.<br />
<br />
* The convergence performance difference between choosing NTM controller with FeedForward vs LTSM appears to hinge on whether the task requires LTSM's internal memory or NTM's external memory as an effective way to store data. Otherwise both controllers are comparable in terms of performance with each other.<br />
<br />
* A bit skeptical about the efforts in tuning LTSM, the paper gives the feeling that the authors spent a lot of time tuning the NTM with different number of heads and controller size in order to achieve desired results for publication.<br />
<br />
* Interested in knowing quantitatively how it would compare against other algorithms such as [https://en.wikipedia.org/wiki/Genetic_programming Genetic Programming] to evolve a turing machines <ref>Naidoo, Amashini, and Nelishia Pillay. "Using genetic programming for turing machine induction." Genetic Programming. Springer Berlin Heidelberg, 2008. 350-361.</ref>, where by its output is a "Program" which theoretically should be better because it doesn't use weights, the "Program" should be more robust, and will require a lot less parameters.<br />
<br />
=References=<br />
<references/></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=neural_Turing_Machines&diff=27293neural Turing Machines2015-12-14T18:08:04Z<p>Alcateri: Added references section and cleaned up referencing in second paragraph</p>
<hr />
<div>= Neural Turing Machines =<br />
<br />
Even though recurrent neural networks (RNNs) are [https://en.wikipedia.org/wiki/Turing_completeness Turing complete] in theory, the control of logical flow and usage of external memory have been largely ignored in the machine learning literature. This might be due to the fact that the RNNs have to be wired properly to achieve the Turing completeness and this is not necessarily easy to achieve in practice. By adding an addressable memory Graves et al. try to overcome this limitation and name their approach Neural Turing Machine (NTM) as analogy to [https://en.wikipedia.org/wiki/Turing_machine Turing machines] that are finite-state machines extended with an infinite memory. Furthermore, every component of an NTM is differentiable and can, thus, be learned.<br />
<br />
== Theoretical Background == <br />
<br />
The authors state that the design of the NTM is inspired by past research spanning the disciplines of neuroscience, psychology, cognitive science and linguistics, and that the NTM can be thought of as a working memory system of the sort described in various accounts of cognitive architecture. However, the authors propose to ignore the known capacity limitations of working memory, and to introduce sophisticated gating and memory addressing operations that are typically absent in models of sort developed throughout the computational neuroscience literature. <br />
<br />
With respect to historical precedents in the cognitive science and linguistics literature, the authors situate their work in relation to a longstanding debate concerning the effectiveness of neural networks for cognitive modeling. They present their work as continuing and advancing a line of research on encoding recursively structured representations in neural networks that stemmed out of criticisms presented by Fodor and Pylyshyn in 1988 <ref name=fodor>Fodor, J. A., & Pylyshyn, Z. W. (1988). [http://www.sciantaanalytics.com/sites/default/files/fodor-pylyshyn.pdf Connectionism and cognitive architecture: A critical analysis.] Cognition, 28(1), 3-71.</ref> (though it is worth pointing out the authors give an incorrect summary of these criticisms - they state that Fodor and Pylyshyn argued that neural networks could not implement variable binding or perform tasks involving variable length structures, when in fact they argued that successful models of cognition require representations with constituent structure and processing mechanisms that strictly structure sensitive - see <ref name=fodor></ref> for details). The NTM is able to deal variable length inputs and arguably performs variable binding in the sense that the memory slots in the NTM can be treated as variables to which data is bound, but the authors do not revisit these issues in any detail after presenting the results of their simulations with the NTM.<br />
<br />
= Architecture =<br />
<br />
A Neural Turing Machine consists of a memory and a controller neural network. The controller receives input and produces output with help of the memory that is addressed with a content- and location based addressing mechanism. Figure 1 presents a high-level diagram of the NTM architecture.<br />
<br />
<center><br />
[[File:Pre_11.PNG | frame | center |Figure 1: Neural Turing Machine Architecture. During each update cycle, the controller network receives inputs from an external environment and emits outputs in response. It also reads to and writes from a memory matrix via a set of parallel read and write heads. The dashed line indicates the division between the NTM circuit and the outside world. ]]<br />
</center><br />
<br />
<br />
== Memory ==<br />
<br />
The memory at time <math>t</math> is given by an <math>N \times M</math> matrix <math>M_t</math>, where <math>N</math> is the number of memory locations and <math>M</math> the vector size at each memory location. To address memory locations for reading or writing an <math>N</math>-element vector <math>w_t</math> is used. The elements in this vector need to satisfy <math>0 \leq w_t(i) \leq 1</math> and have to sum to 1. Thus, it gives weighting of memory locations and the access to a location might be blurry.<br />
<br />
=== Reading ===<br />
<br />
Given an address <math>w_t</math> the read vector is just the weighted sum of memory locations:<br />
<br />
<math>r_t \leftarrow \sum_i w_t(i) M_t(i)</math><br />
<br />
which is clearly differentiable with respect to both the memory and the weighting.<br />
<br />
=== Writing ===<br />
<br />
The write process is split up into an erase and an add operation (inspired by the input and forget gates in LSTM). This allows the NTM to both overwrite or add to a memory location in a single time step. Otherwise it would be necessary to do a read for one of the operations first before the updated result can be written.<br />
<br />
The erase update is given by<br />
<br />
<math>\tilde{M}_t(i) \leftarrow M_{t-1}(i) [1 - w_t(i) e_t]</math><br />
<br />
with an <math>M</math>-element ''erase vector'' <math>e_t</math> with elements in the range <math>(0, 1)</math> selecting which vector elements to reset at the memory locations selected by <math>w_t</math>.<br />
<br />
Afterwords an ''add vector'' <math>a_t</math> is added according to<br />
<br />
<math>M_t(i) \leftarrow \tilde{M}_t(i) + w_t(i) a_t.</math><br />
<br />
The order in which the adds are performed by multiple heads is irrelevant. The combined erase and add operations of all the write heads produced the final content of the memory at time ''t''.<br />
<br />
== Addressing Mechanisms ==<br />
<br />
Two methods, content-based addressing and location-based addressing, are employed to generate the read/write weightings <math>w_t</math>. Depending on the task either mechanism can be more appropriate. These two methods of addressing are summarized in the figure below.<br />
<br />
[[File:flow_diagram_addressing_mechanism.JPG | center]]<br />
<br />
=== Content-based addressing ===<br />
<br />
For content-addressing, each head (whether employed for reading or writing) first produces a length <math>M</math> key vector <math>k_t</math> that is compared to each vector <math>M_t (i)</math> by a similarity measure <math>K[.,.]</math>. The content-based system produces a normalised weighting <math>w_c^t</math> based on the similarity and a positive key strength, <math>\beta_t</math>, which can amplify or attenuate the precision of the focus:<br />
<br />
<br />
<math><br />
w_c^t \leftarrow \frac{exp(\beta_t K[K_t,M_t(i)])}{\sum_{j} exp(\beta_t K[K_t,M_t(j)])}<br />
</math><br />
<br />
In this current implementation, the similarity measure is cosine similarity:<br />
<br />
<math><br />
K[u,v] = \frac{u.v}{||u||.||v||}<br />
</math><br />
<br />
=== Location-based addressing ===<br />
<br />
The location-based addressing mechanism is designed to facilitate both simple iterations across the locations of the memory and random-access jumps. It does so by implementing a rotational shift of a weighting. Prior to rotation, each head emits a scalar interpolation gate <math>g_t</math> in the range (0, 1). The value of <math>g</math> is used to blend between the weighting <math>w_{t-1}</math> produced by the head at the previous time-step and the weighting <math>w_t^c</math> produced by the content system at the current time-step, yielding the gated weighting <math>w_t^g</math> :<br />
<br />
<math><br />
w_t^g \leftarrow g_t w_t^c + (1-g_t) w_{t-1}<br />
</math><br />
<br />
After interpolation, each head emits a shift weighting <math>s_t</math> that defines a normalised distribution over the allowed integer shifts. Each element in this vector gives the degree by which different integer shifts are performed. For example, if shifts of -1, 0, 1 are allowed a (0, 0.3, 0.7) shift vector would denote a shift of 1 with strength 0.7 and a shift of 0 (no-shift) with strength 0.3. The actual shift is performed with a circular convolution<br />
<br />
<math>\tilde{w}_t(i) \leftarrow \sum_{j=0}^{N-1} w_t^g(j) s_t(i - j)</math><br />
<br />
where all index arithmetic is modulo N. This circular convolution can lead to blurring of the weights and <math>_t</math> will be sharpened with<br />
<br />
<math>w_t(i) \leftarrow \frac{\tilde{w}_t(i)^{\gamma_t}}{\sum_j<br />
\tilde{w}_t(j)^{\gamma_t}}</math><br />
<br />
where <math>_t 1</math> is an additional scalar outputted by the write head.<br />
<br />
== Controller ==<br />
<br />
The controller receives the external input and read head output and produces the addressing vectors and related values (for example shift weighting) for the read and write heads. It also produces an external output.<br />
<br />
Different types of controllers can be used. The paper discusses feed-forward and LSTM controllers. Feed-forward controllers are simpler, but are more limited than LSTM controllers since the type of operations they can perform are limited by the number of concurrent read and write heads. The LSTM controller, given it's internal register-like memory, does not suffer from this limitation.<br />
<br />
= Experiments =<br />
Team wanted to see if a network can be trained to copy sequences of length up to 20 could copy a sequence of length 100 with no further training. For all of the experiments, three architectures were compared: NTM with a feedforward controller, TML with an LSTM controller, and a standard LSTM network. All the tasks were supervised learning problems with binary targets; all networks had logistic sigmoid output layers and were trained with the cross-entropy objective function. Sequence prediction errors are reported in bits-per-sequence.<br />
<br />
= Results =<br />
<br />
The authors tested the NTM with a feed-forward and an LSTM controller against a pure LSTM on multiple tasks:<br />
<br />
* Copy Task: An input sequence has to reproduced.<br />
* Repeat Copy Task: An input sequence has to reproduced multiple times.<br />
* Associative Recall: After providing an input sequence the network is queried with one item of the sequence and has to produce the next.<br />
* Dynamic N-Grams: Predict the probability of the next bit being 0 or 1 given the last six bits.<br />
* Priority Sort: Sort an input sequence according to given priorities.<br />
<br />
<br />
[[File:copy_convergence.png|frame|center|Copy Task Learning Curve]]<br />
[[File:repeat_copy_convergence.png|frame|center|Repeat Copy Task Learning Curve]]<br />
[[File:recall_convergence.png|frame|center|Associative Recall Learning Curve]]<br />
[[File:ngrams_convergence.png|frame|center|Dynamic N-Grams Learning Curve]]<br />
[[File:sort_convergence.png|frame|center|Priority Sort Learning Curve]]<br />
<br />
[[File:ntm_feedforward_settings.png|frame|center|NTM with Feedforard Controller Experimental Settings]]<br />
[[File:ntm_ltsm_settings.png|frame|center|NTM with LTSM Controller Experimental Settings]]<br />
[[File:ltsm_settings.png|frame|center|LTSM Controller Experimental Settings]]<br />
<br />
In all tasks the combination of NTM with Feedforward or LTSM converges faster and obtains better generalization than a pure LSTM controller.<br />
<br />
<br />
= Discussion =<br />
* While the experimental results show great promise of the NTM architecture, the paper could be serviced with a more in-depth experimental results discussion as to why NTM performs very well when combined with Feedforward or LTSM compared to a pure LTSM.<br />
<br />
* The convergence performance difference between choosing NTM controller with FeedForward vs LTSM appears to hinge on whether the task requires LTSM's internal memory or NTM's external memory as an effective way to store data. Otherwise both controllers are comparable in terms of performance with each other.<br />
<br />
* A bit skeptical about the efforts in tuning LTSM, the paper gives the feeling that the authors spent a lot of time tuning the NTM with different number of heads and controller size in order to achieve desired results for publication.<br />
<br />
* Interested in knowing quantitatively how it would compare against other algorithms such as [https://en.wikipedia.org/wiki/Genetic_programming Genetic Programming] to evolve a turing machines <ref>Naidoo, Amashini, and Nelishia Pillay. "Using genetic programming for turing machine induction." Genetic Programming. Springer Berlin Heidelberg, 2008. 350-361.</ref>, where by its output is a "Program" which theoretically should be better because it doesn't use weights, the "Program" should be more robust, and will require a lot less parameters.<br />
<br />
=References=<br />
<references/></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=27269learning Phrase Representations2015-12-13T05:11:00Z<p>Alcateri: /* Experiments */ - Added some specifics to the experimental results</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes.<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
Because each hidden unit has separate gates, it is possible for each hidden to unit to learn to capture dependencies over different lengths of time (determined by the frequency at which its reset and updates gates are active).<br />
<br />
=Scoring Phrase Pairs with RNN Encoder–Decoder =<br />
<br />
In a commonly used statistical machine translation system (SMT), the goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes<br />
<br />
<math>p(\mathbf{f} | \mathbf{e})\propto p(\mathbf{e} | \mathbf{f})p(\mathbf{f})</math><br />
<br />
where the first term at the right hand side is called translation model and the latter language model (see, e.g., (Koehn, 2005)). In practice, however, most SMT systems model <math>\log p(\mathbf{f} | \mathbf{e})</math> as a loglinear model with additional features and corresponding weights:<br />
<br />
<math>\log p(\mathbf{f} | \mathbf{e})=\sum_{n=1}^Nw_nf_n(\mathbf{f},\mathbf{e})+\log Z(\mathbf{e})</math><br />
<br />
where <math>f_n</math> and <math>w_n</math> are the n-th feature and weight, respectively. <math>Z(\mathbf{e})</math> is a normalization constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a development set.<br />
<br />
Cho et al. propose to train the RNN Encoder–Decoder on a table of phrase pairs and use its scores as additional features in the loglinear model showed above when tuning the SMT decoder. For training the RNN-Encoder-Decoder, phrase frequency is ignored for several reasons: to reduce computation time, to ensure the model does not simply rank phrases by frequency, and because frequency information is already encoded in the features for the SMT (so it's better to not use the capacity of the RNN-Encoder-Decoder redundantly).<br />
<br />
=Alternative Models=<br />
The researchers noted a number of other potential translation models and their usability.<br />
<br />
The first model is by Schwenk and it is an application of a variant of the continuous space language model to the task of machine translation. The model is essentially a feedforward neural network with a common projection for input words encoded as bag of words vectors. Schwenk fixed the input and output sentence length and for a fixed length, the neural network attempts to estimate the probability of the output sequence of words and score potential translations. However, a major disadvantage is that the input and output length are fixed and cannot handle variable length inputs or outputs.<br />
<br />
The model figure<ref><br />
[Schwenk2012] Holger Schwenk. 2012. Continuous<br />
space translation models for phrase-based statistical<br />
machine translation. In Martin Kay and Christian<br />
Boitet, editors, Proceedings of the 24th International<br />
Conference on Computational Linguistics<br />
(COLIN), pages 1071–1080.<br />
</ref>:<br />
<br />
[[File:CONTINUOUS.PNG]]<br />
<br />
Another model, similar to Schwenk's, is by Devlin and a feedforward neural network is also used. Rather than estimating the probability of the entire output sequence of words in Schwenk's model, Devlin only estimates the probability of the next word and uses both a portion of the input sentence and a portion of the output sentence. It reported impressive improvements but similar to Schwenk, it fixes the length of input prior to training.<br />
<br />
Chandar et al. trained a feedforward neural network to learn a mapping from a bag-of-words representation of an input phrase to an output phrase.<ref><br />
Lauly, Stanislas, et al. "An autoencoder approach to learning bilingual word representations." Advances in Neural Information Processing Systems. 2014.<br />
</ref> This is closely related to both the proposed RNN Encoder–Decoder and the model<br />
proposed by Schwenk, except that their input representation of a phrase is a bag-of-words. A similar approach of using bag-of-words representations was proposed by Gao<ref><br />
Gao, Jianfeng, et al. "Learning semantic representations for the phrase translation model." arXiv preprint arXiv:1312.0482 (2013).<br />
</ref> as well. One important difference between the proposed RNN Encoder–Decoder and the above approaches is that the order of the words in source and target phrases is taken into account. The RNN Encoder–Decoder naturally distinguishes between sequences that have the same words but in a different order, whereas the aforementioned approaches effectively ignore order information.<br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. In building the model Cho et al. used baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref><br />
<br />
The following model combinations were tested:<br />
# Baseline configuration<br />
# Baseline + RNN<br />
# Baseline + CSLM + RNN<br />
# Baseline + CSLM + RNN + Word penalty<br />
<br />
The results are shown in Figure 3. The RNN encoder-decoder consisted of 1000 hidden units. Rank-100 matrices were used to connect the input to the hidden unit. The "word penalty" attempts to penalize the words unknown to the neural network, which is accomplished by using the number of unknown words as a feature in the log-linear model above. <br />
<br />
<center><br />
[[File:encdec3.png |frame | center |Fig 3. BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks. ]]<br />
</center><br />
<br />
The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder–Decoder. This suggests that the contributions of the CSLM and the RNN Encoder– Decoder are not too correlated and that one can expect better results by improving each method independently<br />
<br />
<br />
== Word and Phrase Representations ==<br />
<br />
As the presented model maps sentences into a continuous space vector and prior continuous space language models have been known to learn semantically meaningful embeddings, one could expect this to happen for the presented model, too. This is indeed the case. When projecting to a 2D space (with Barnes-Hut-SNE), semantically similar words are clearly clustered.<br />
<br />
[[File:Fig4.png]]<br />
<br />
Phrases are also clustered capturing both semantic and syntactic structures.<br />
<br />
[[File:Fig5.png]]<br />
<br />
= References=<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=extracting_and_Composing_Robust_Features_with_Denoising_Autoencoders&diff=27268extracting and Composing Robust Features with Denoising Autoencoders2015-12-13T03:31:06Z<p>Alcateri: /* Analysis of the Denoising Autoencoder */ - Included section on Stochastic Operator Perspective</p>
<hr />
<div>= Introduction =<br />
This Paper explores a new training principle for unsupervised learning<br />
of a representation based on the idea of making the learned representations<br />
robust to partial corruption of the input pattern. This approach can<br />
be used to train autoencoders, and these denoising autoencoders can be<br />
stacked to initialize deep architectures. The algorithm can be motivated<br />
from a manifold learning and information theoretic perspective or from a<br />
generative model perspective.<br />
== Motivation ==<br />
<br />
The approach is based on the use of an unsupervised<br />
training criterion to perform a layer-by-layer initialization. The procedure is as follows :<br />
Each layer is at first trained to produce a higher level (hidden) representation of the observed patterns,<br />
based on the representation it receives as input from the layer below, by<br />
optimizing a local unsupervised criterion. Each level produces a representation<br />
of the input pattern that is more abstract than the previous level’s, because it<br />
is obtained by composing more operations. This initialization yields a starting<br />
point, from which a global fine-tuning of the model’s parameters is then performed<br />
using another training criterion appropriate for the task at hand.<br />
<br />
This process gives better solutions than the one obtained by random initializations<br />
<br />
= The Denoising Autoencoder =<br />
<br />
A Denoising Autoencoder reconstructs<br />
a clean “repaired” input from a corrupted, partially destroyed one. This<br />
is done by first corrupting the initial input <math>x</math> to get a partially destroyed version<br />
<math>\tilde{x}</math> by means of a stochastic mapping. In this paper the noise is added by randomly zeroing a fixed number, <math>v_d</math>, of components and leaving the rest untouched.<br />
Thus the objective function can be described as<br />
[[File:W1.png]]<br />
<br />
The objective function minimized by<br />
stochastic gradient descent becomes: <br />
[[File:W2.png]]<br />
<br />
where the loss function is the cross entropy of the model<br />
The denoising autoencoder can be shown in the figure as <br />
<br />
[[File:W3.png]]<br />
<br />
= Layer-wise Initialization and Fine Tuning =<br />
<br />
While training the denoising autoencoder k-th layer used as<br />
input for the (k + 1)-th, and the (k + 1)-th layer trained after the k-th has been<br />
trained. After a few layers have been trained, the parameters are used as initialization<br />
for a network optimized with respect to a supervised training criterion.<br />
This greedy layer-wise procedure has been shown to yield significantly better<br />
local minima than random initialization of deep networks,<br />
achieving better generalization on a number of tasks.<br />
<br />
= Analysis of the Denoising Autoencoder =<br />
== Manifold Learning Perspective ==<br />
<br />
<br />
The process of mapping a corrupted example to an uncorrupted one can be<br />
visualized in Figure 2, with a low-dimensional manifold <math>\mathcal{M}</math> near which the data<br />
concentrate. We learn a stochastic operator <math>p(X|\tilde{X})</math> that maps an <math>\tilde{X}</math> to an <math>X\,</math>.<br />
<br />
<br />
[[File:q4.png]]<br />
<br />
Since the corrupted points <math>\tilde{X}</math> will likely not be on <math>\mathcal{M}</math>, the learned map <math>p(X|\tilde{X})</math> is able to determine how to transform points away from <math>\mathcal{M}</math> into points on <math>\mathcal{M}</math>.<br />
<br />
The denoising autoencoder can thus be seen as a way to define and learn a<br />
manifold. The intermediate representation <math>Y = f(X)</math> can be interpreted as a<br />
coordinate system for points on the manifold (this is most clear if we force the<br />
dimension of <math>Y</math> to be smaller than the dimension of <math>X</math>). More generally, one can<br />
think of <math>Y = f(X)</math> as a representation of <math>X</math> which is well suited to capture the<br />
main variations in the data, i.e., on the manifold. When additional criteria (such<br />
as sparsity) are introduced in the learning model, one can no longer directly view<br />
<math>Y = f(X)</math> as an explicit low-dimensional coordinate system for points on the<br />
manifold, but it retains the property of capturing the main factors of variation<br />
in the data.<br />
<br />
== Stochastic Operator Perspective ==<br />
<br />
The denoising autoencoder can also be seen as corresponding to a semi-parametric model that can be sampled from. Define the joint distribution as follows: <br />
<br />
:<math>p(X, \tilde{X}) = p(\tilde{X}) p(X|\tilde{X}) = q^0(\tilde{X}) p(X|\tilde{X}) </math> <br />
<br />
from the stochastic operator <math>p(X | \tilde{X})</math>, with <math>q^0\,</math> being the empirical distribution.<br />
<br />
Using the Kullback-Leibler divergence, defined as:<br />
<br />
:<math>\mathbb{D}_{KL}(p|q) = \mathbb{E}_{p(X)} \left(\log\frac{p(X)}{q(X)}\right) </math><br />
<br />
then minimizing <math>\mathbb{D}_{KL}(q^0(X, \tilde{X}) | p(X, \tilde{X})) </math> yields the originally-formulated denoising criterion. Furthermore, as this objective is minimized, the marginals of <math>\,p</math> approach those of <math>\,q^0</math>, i.e. <math> p(X) \rightarrow q^0(X)</math>. Then, if <math>\,p</math> is expanded in the following way:<br />
<br />
:<math> p(X) = \frac{1}{n}\sum_{i=1}^n \sum_{\tilde{\mathbf{x}}} p(X|\tilde{X} = \tilde{\mathbf{x}}) q_{\mathcal{D}}(\tilde{\mathbf{x}} | \mathbf{x}_i) </math><br />
<br />
it becomes clear that the denoising autoencoder learns a semi-parametric model that can be sampled from (since <math>p(X)</math> above is easy to sample from). <br />
<br />
== Information Theoretic Perspective ==<br />
<br />
It is also possible to adopt an information theoretic perspective. The representation of the autonencoder should retain as much information as possible while at the same time certain properties, like a limited complexity, are imposed on the marginal distribution. This can be expressed as an optimization of <math>\arg\max_{\theta} \{I(X;Y) + \lambda \mathcal{J}(Y)\}</math> where <math>I(X; Y)</math> is the mutual information between an input sample <math>X</math> and the hidden representation <math>Y</math> and <math>\mathcal{J}</math> is a functional expressing the preference over the marginal. The hyper-parameter <math>\lambda</math> controls the trade-off between maximazing the mutual information and keeping the marginal simple.<br />
<br />
Note that this reasoning also applies to the basic autoencoder, but the denoising autoencoder maximizes the mutual information between <math>X</math> and <math>Y</math> while <math>Y</math> can also be a function of corrupted input.<br />
<br />
= Experiments =<br />
The Input contains different<br />
variations of the MNIST digit classification problem, with added factors of<br />
variation such as rotation (rot), addition of a background composed of random<br />
pixels (bg-rand) or made from patches extracted from a set of images (bg-img), or<br />
combinations of these factors (rot-bg-img). These variations render the problems particularly challenging for current generic learning algorithms. Each problem<br />
is divided into a training, validation, and test set (10000, 2000, 50000 examples<br />
respectively). A subset of the original MNIST problem is also included with the<br />
same example set sizes (problem basic). The benchmark also contains additional<br />
binary classification problems: discriminating between convex and non-convex<br />
shapes (convex), and between wide and long rectangles (rect, rect-img).<br />
Neural networks with 3 hidden layers initialized by stacking denoising autoencoders<br />
(SdA-3), and fine tuned on the classification tasks, were evaluated<br />
on all the problems in this benchmark. Model selection was conducted following<br />
a similar procedure as Larochelle et al. (2007). Several values of hyper<br />
parameters (destruction fraction ν, layer sizes, number of unsupervised training<br />
epochs) were tried, combined with early stopping in the fine tuning phase. For<br />
each task, the best model was selected based on its classification performance<br />
on the validation set.<br />
The results can be reported in the following table.<br />
[[File:W5.png]]<br />
<br />
The filter obtained by training are shown the the figure below<br />
<br />
<br />
[[File:Qq3.png]]<br />
<br />
= Conclusion and Future Work =<br />
<br />
The paper shows a denoising Autoencoder which was motivated by the goal of<br />
learning representations of the input that are robust to small irrelevant changes<br />
in input. Several perspectives also help to motivate it from a manifold learning<br />
perspective and from the perspective of a generative model.<br />
This principle can be used to train and stack autoencoders to initialize a<br />
deep neural network. A series of image classification experiments were performed<br />
to evaluate this new training principle. The empirical results support<br />
the following conclusions: unsupervised initialization of layers with an explicit<br />
denoising criterion helps to capture interesting structure in the input distribution.<br />
This in turn leads to intermediate representations much better suited for<br />
subsequent learning tasks such as supervised classification. The experimental<br />
results with Deep Belief Networks (whose layers are initialized as RBMs) suggest<br />
that RBMs may also encapsulate a form of robustness in the representations<br />
they learn, possibly because of their stochastic nature, which introduces noise<br />
in the representation during training<br />
<br />
= References =<br />
<br />
Bengio, Y. (2007). Learning deep architectures for AI (Technical Report 1312).<br />
Universit´e de Montr´eal, dept. IRO.<br />
<br />
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layerwise<br />
training of deep networks. Advances in Neural Information Processing<br />
Systems 19 (pp. 153–160). MIT Press.<br />
<br />
Bengio, Y., & Le Cun, Y. (2007). Scaling learning algorithms towards AI. In<br />
L. Bottou, O. Chapelle, D. DeCoste and J. Weston (Eds.), Large scale kernel<br />
machines. MIT Press.<br />
<br />
Doi, E., Balcan, D. C., & Lewicki, M. S. (2006). A theoretical analysis of<br />
robust coding over noisy overcomplete channels. In Y. Weiss, B. Sch¨olkopf<br />
and J. Platt (Eds.), Advances in neural information processing systems 18,<br />
307–314. Cambridge, MA: MIT Press.<br />
<br />
Doi, E., & Lewicki, M. S. (2007). A theory of retinal population coding. NIPS<br />
(pp. 353–360). MIT Press.<br />
<br />
Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant<br />
representations over learned dictionaries. IEEE Transactions on Image Processing,<br />
15, 3736–3745.<br />
<br />
Gallinari, P., LeCun, Y., Thiria, S., & Fogelman-Soulie, F. (1987). Memoires<br />
associatives distribuees. Proceedings of COGNITIVA 87. Paris, La Villette<br />
<br />
Hammond, D., & Simoncelli, E. (2007). A machine learning framework for adaptive<br />
combination of signal denoising methods. 2007 International Conference<br />
on Image Processing (pp. VI: 29–32).<br />
<br />
Hinton, G. (1989). Connectionist learning procedures. Artificial Intelligence,<br />
40, 185–234.<br />
Hinton, G., & Salakhutdinov, R. (2006). Reducing the dimensionality of data<br />
with neural networks. Science, 313, 504–507.<br />
<br />
Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for<br />
deep belief nets. Neural Computation, 18, 1527–1554.<br />
<br />
Hopfield, J. (1982). Neural networks and physical systems with emergent collective<br />
computational abilities. Proceedings of the National Academy of Sciences,<br />
USA, 79.<br />
<br />
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007).<br />
An empirical evaluation of deep architectures on problems with many factors<br />
of variation. Twenty-fourth International Conference on Machine Learning<br />
(ICML’2007).<br />
<br />
LeCun, Y. (1987). Mod`eles connexionistes de l’apprentissage. Doctoral dissertation,<br />
Universit´e de Paris VI.<br />
<br />
Lee, H., Ekanadham, C., & Ng, A. (2008). Sparse deep belief net model for visual<br />
area V2. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in<br />
neural information processing systems 20. Cambridge, MA: MIT Press.<br />
<br />
McClelland, J., Rumelhart, D., & the PDP Research Group (1986). Parallel<br />
distributed processing: Explorations in the microstructure of cognition, vol. 2.<br />
Cambridge: MIT Press.<br />
<br />
Memisevic, R. (2007). Non-linear latent factor models for revealing structure<br />
in high-dimensional data. Doctoral dissertation, Departement of Computer<br />
Science, University of Toronto, Toronto, Ontario, Canada.<br />
<br />
Ranzato, M., Boureau, Y.-L., & LeCun, Y. (2008). Sparse feature learning for<br />
deep belief networks. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.),<br />
Advances in neural information processing systems 20. Cambridge, MA: MIT<br />
Press.<br />
<br />
Ranzato, M., Poultney, C., Chopra, S., & LeCun, Y. (2007). Efficient learning<br />
of sparse representations with an energy-based model. Advances in Neural<br />
Information Processing Systems (NIPS 2006). MIT Press.<br />
<br />
Roth, S., & Black, M. (2005). Fields of experts: a framework for learning image<br />
priors. IEEE Conference on Computer Vision and Pattern Recognition (pp.<br />
860–867).</div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=extracting_and_Composing_Robust_Features_with_Denoising_Autoencoders&diff=27267extracting and Composing Robust Features with Denoising Autoencoders2015-12-13T02:50:20Z<p>Alcateri: /* The Denoising Autoencoder */ - Cleaned up the wording and mathematical notation in this section</p>
<hr />
<div>= Introduction =<br />
This Paper explores a new training principle for unsupervised learning<br />
of a representation based on the idea of making the learned representations<br />
robust to partial corruption of the input pattern. This approach can<br />
be used to train autoencoders, and these denoising autoencoders can be<br />
stacked to initialize deep architectures. The algorithm can be motivated<br />
from a manifold learning and information theoretic perspective or from a<br />
generative model perspective.<br />
== Motivation ==<br />
<br />
The approach is based on the use of an unsupervised<br />
training criterion to perform a layer-by-layer initialization. The procedure is as follows :<br />
Each layer is at first trained to produce a higher level (hidden) representation of the observed patterns,<br />
based on the representation it receives as input from the layer below, by<br />
optimizing a local unsupervised criterion. Each level produces a representation<br />
of the input pattern that is more abstract than the previous level’s, because it<br />
is obtained by composing more operations. This initialization yields a starting<br />
point, from which a global fine-tuning of the model’s parameters is then performed<br />
using another training criterion appropriate for the task at hand.<br />
<br />
This process gives better solutions than the one obtained by random initializations<br />
<br />
= The Denoising Autoencoder =<br />
<br />
A Denoising Autoencoder reconstructs<br />
a clean “repaired” input from a corrupted, partially destroyed one. This<br />
is done by first corrupting the initial input <math>x</math> to get a partially destroyed version<br />
<math>\tilde{x}</math> by means of a stochastic mapping. In this paper the noise is added by randomly zeroing a fixed number, <math>v_d</math>, of components and leaving the rest untouched.<br />
Thus the objective function can be described as<br />
[[File:W1.png]]<br />
<br />
The objective function minimized by<br />
stochastic gradient descent becomes: <br />
[[File:W2.png]]<br />
<br />
where the loss function is the cross entropy of the model<br />
The denoising autoencoder can be shown in the figure as <br />
<br />
[[File:W3.png]]<br />
<br />
= Layer-wise Initialization and Fine Tuning =<br />
<br />
While training the denoising autoencoder k-th layer used as<br />
input for the (k + 1)-th, and the (k + 1)-th layer trained after the k-th has been<br />
trained. After a few layers have been trained, the parameters are used as initialization<br />
for a network optimized with respect to a supervised training criterion.<br />
This greedy layer-wise procedure has been shown to yield significantly better<br />
local minima than random initialization of deep networks,<br />
achieving better generalization on a number of tasks.<br />
<br />
= Analysis of the Denoising Autoencoder =<br />
== Manifold Learning Perspective ==<br />
<br />
<br />
The process of mapping a corrupted example to an uncorrupted one can be<br />
visualized in Figure 2, with a low-dimensional manifold near which the data<br />
concentrate. We learn a stochastic operator p(X|~X) that maps an ~X to an X.<br />
<br />
<br />
[[File:q4.png]]<br />
<br />
<br />
The denoising autoencoder can thus be seen as a way to define and learn a<br />
manifold. The intermediate representation Y = f(X) can be interpreted as a<br />
coordinate system for points on the manifold (this is most clear if we force the<br />
dimension of Y to be smaller than the dimension of X). More generally, one can<br />
think of Y = f(X) as a representation of X which is well suited to capture the<br />
main variations in the data, i.e., on the manifold. When additional criteria (such<br />
as sparsity) are introduced in the learning model, one can no longer directly view<br />
Y = f(X) as an explicit low-dimensional coordinate system for points on the<br />
manifold, but it retains the property of capturing the main factors of variation<br />
in the data.<br />
<br />
<br />
== Information Theoretic Perspective ==<br />
<br />
It is also possible to adopt an information theoretic perspective. The representation of the autonencoder should retain as much information as possible while at the same time certain properties, like a limited complexity, are imposed on the marginal distribution. This can be expressed as an optimization of <math>\arg\max_{\theta} \{I(X;Y) + \lambda \mathcal{J}(Y)\}</math> where <math>I(X; Y)</math> is the mutual information between an input sample <math>X</math> and the hidden representation <math>Y</math> and <math>\mathcal{J}</math> is a functional expressing the preference over the marginal. The hyper-parameter <math>\lambda</math> controls the trade-off between maximazing the mutual information and keeping the marginal simple.<br />
<br />
Note that this reasoning also applies to the basic autoencoder, but the denoising autoencoder maximizes the mutual information between <math>X</math> and <math>Y</math> while <math>Y</math> can also be a function of corrupted input.<br />
<br />
= Experiments =<br />
The Input contains different<br />
variations of the MNIST digit classification problem, with added factors of<br />
variation such as rotation (rot), addition of a background composed of random<br />
pixels (bg-rand) or made from patches extracted from a set of images (bg-img), or<br />
combinations of these factors (rot-bg-img). These variations render the problems particularly challenging for current generic learning algorithms. Each problem<br />
is divided into a training, validation, and test set (10000, 2000, 50000 examples<br />
respectively). A subset of the original MNIST problem is also included with the<br />
same example set sizes (problem basic). The benchmark also contains additional<br />
binary classification problems: discriminating between convex and non-convex<br />
shapes (convex), and between wide and long rectangles (rect, rect-img).<br />
Neural networks with 3 hidden layers initialized by stacking denoising autoencoders<br />
(SdA-3), and fine tuned on the classification tasks, were evaluated<br />
on all the problems in this benchmark. Model selection was conducted following<br />
a similar procedure as Larochelle et al. (2007). Several values of hyper<br />
parameters (destruction fraction ν, layer sizes, number of unsupervised training<br />
epochs) were tried, combined with early stopping in the fine tuning phase. For<br />
each task, the best model was selected based on its classification performance<br />
on the validation set.<br />
The results can be reported in the following table.<br />
[[File:W5.png]]<br />
<br />
The filter obtained by training are shown the the figure below<br />
<br />
<br />
[[File:Qq3.png]]<br />
<br />
= Conclusion and Future Work =<br />
<br />
The paper shows a denoising Autoencoder which was motivated by the goal of<br />
learning representations of the input that are robust to small irrelevant changes<br />
in input. Several perspectives also help to motivate it from a manifold learning<br />
perspective and from the perspective of a generative model.<br />
This principle can be used to train and stack autoencoders to initialize a<br />
deep neural network. A series of image classification experiments were performed<br />
to evaluate this new training principle. The empirical results support<br />
the following conclusions: unsupervised initialization of layers with an explicit<br />
denoising criterion helps to capture interesting structure in the input distribution.<br />
This in turn leads to intermediate representations much better suited for<br />
subsequent learning tasks such as supervised classification. The experimental<br />
results with Deep Belief Networks (whose layers are initialized as RBMs) suggest<br />
that RBMs may also encapsulate a form of robustness in the representations<br />
they learn, possibly because of their stochastic nature, which introduces noise<br />
in the representation during training<br />
<br />
= References =<br />
<br />
Bengio, Y. (2007). Learning deep architectures for AI (Technical Report 1312).<br />
Universit´e de Montr´eal, dept. IRO.<br />
<br />
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layerwise<br />
training of deep networks. Advances in Neural Information Processing<br />
Systems 19 (pp. 153–160). MIT Press.<br />
<br />
Bengio, Y., & Le Cun, Y. (2007). Scaling learning algorithms towards AI. In<br />
L. Bottou, O. Chapelle, D. DeCoste and J. Weston (Eds.), Large scale kernel<br />
machines. MIT Press.<br />
<br />
Doi, E., Balcan, D. C., & Lewicki, M. S. (2006). A theoretical analysis of<br />
robust coding over noisy overcomplete channels. In Y. Weiss, B. Sch¨olkopf<br />
and J. Platt (Eds.), Advances in neural information processing systems 18,<br />
307–314. Cambridge, MA: MIT Press.<br />
<br />
Doi, E., & Lewicki, M. S. (2007). A theory of retinal population coding. NIPS<br />
(pp. 353–360). MIT Press.<br />
<br />
Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant<br />
representations over learned dictionaries. IEEE Transactions on Image Processing,<br />
15, 3736–3745.<br />
<br />
Gallinari, P., LeCun, Y., Thiria, S., & Fogelman-Soulie, F. (1987). Memoires<br />
associatives distribuees. Proceedings of COGNITIVA 87. Paris, La Villette<br />
<br />
Hammond, D., & Simoncelli, E. (2007). A machine learning framework for adaptive<br />
combination of signal denoising methods. 2007 International Conference<br />
on Image Processing (pp. VI: 29–32).<br />
<br />
Hinton, G. (1989). Connectionist learning procedures. Artificial Intelligence,<br />
40, 185–234.<br />
Hinton, G., & Salakhutdinov, R. (2006). Reducing the dimensionality of data<br />
with neural networks. Science, 313, 504–507.<br />
<br />
Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for<br />
deep belief nets. Neural Computation, 18, 1527–1554.<br />
<br />
Hopfield, J. (1982). Neural networks and physical systems with emergent collective<br />
computational abilities. Proceedings of the National Academy of Sciences,<br />
USA, 79.<br />
<br />
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007).<br />
An empirical evaluation of deep architectures on problems with many factors<br />
of variation. Twenty-fourth International Conference on Machine Learning<br />
(ICML’2007).<br />
<br />
LeCun, Y. (1987). Mod`eles connexionistes de l’apprentissage. Doctoral dissertation,<br />
Universit´e de Paris VI.<br />
<br />
Lee, H., Ekanadham, C., & Ng, A. (2008). Sparse deep belief net model for visual<br />
area V2. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in<br />
neural information processing systems 20. Cambridge, MA: MIT Press.<br />
<br />
McClelland, J., Rumelhart, D., & the PDP Research Group (1986). Parallel<br />
distributed processing: Explorations in the microstructure of cognition, vol. 2.<br />
Cambridge: MIT Press.<br />
<br />
Memisevic, R. (2007). Non-linear latent factor models for revealing structure<br />
in high-dimensional data. Doctoral dissertation, Departement of Computer<br />
Science, University of Toronto, Toronto, Ontario, Canada.<br />
<br />
Ranzato, M., Boureau, Y.-L., & LeCun, Y. (2008). Sparse feature learning for<br />
deep belief networks. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.),<br />
Advances in neural information processing systems 20. Cambridge, MA: MIT<br />
Press.<br />
<br />
Ranzato, M., Poultney, C., Chopra, S., & LeCun, Y. (2007). Efficient learning<br />
of sparse representations with an energy-based model. Advances in Neural<br />
Information Processing Systems (NIPS 2006). MIT Press.<br />
<br />
Roth, S., & Black, M. (2005). Fields of experts: a framework for learning image<br />
priors. IEEE Conference on Computer Vision and Pattern Recognition (pp.<br />
860–867).</div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=the_loss_surfaces_of_multilayer_networks_(Choromanska_et_al.)&diff=27238the loss surfaces of multilayer networks (Choromanska et al.)2015-12-13T00:05:47Z<p>Alcateri: Adding a "Prior Work" section, along with showing the references.</p>
<hr />
<div>= Overview =<br />
<br />
The paper ''Loss Surfaces of Multilayer Networks'' by Choromanska et al. is situated in the context of determining critical points (i.e. minima, maxima, or saddle points) of loss surfaces of deep multilayer network models, such as feedforward perceptrons.<br />
<br />
The authors present a model of multilayer rectified linear units (ReLUs), and show that it may be expressed as a polynomial function of the parameter matrices in the network, with a polynomial degree equal to the number of layers. The <span>ReLu</span> units produce a piecewise, continuous polynomial, with monomials that are nonzero or zero at the boundaries between pieces. With this model, they study the distribution of critical points of the loss polynomial, providing an analysis with results from random matrix theory applied to spherical spin glasses.<br />
<br />
The 3 key findings of this work are the following:<br />
<br />
* For large-size networks, most local minima are equivalent and yield similar performance on a test set.<br />
* The probability of finding a ''bad'' local minimum (i.e. one with a large value in terms of the loss function) may be large for small-size networks, but decreases quickly with network size.<br />
* Obtaining the global minimum of the loss function using a training dataset is not useful in practice.<br />
<br />
Many theoretical results are reported, which will not be exhaustively covered here. However, a high-level overview of proof techniques will be given, followed by a summary of the experimental results.<br />
<br />
= Prior Work =<br />
<br />
Earlier work has shown, for high-dimensional random Gaussian error functions, that critical points with error much higher than the global minimum are very likely to be saddle points (e.g. <ref>Bray, A. J., & Dean, D. S. (2007). Statistics of critical points of Gaussian fields on large-dimensional spaces. Physical review letters, 98(15), 150201.</ref>). Furthermore, all local minima are likely to be very close in functional value to the global minimum. Dauphin et al. <ref> Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). [http://papers.nips.cc/paper/5486-identifying-and-attacking-the-saddle-point-problem-in-high-dimensional-non-convex-optimization.pdf Identifying and attacking the saddle point problem in high-dimensional non-convex optimization.] In Advances in Neural Information Processing Systems (pp. 2933-2941). </ref> empirically show that the cost functions of neural networks behave similarly to Gaussian error functions in high-dimensional spaces, but no theoretical justification is provided. This is one of the main contributions of this paper.<br />
<br />
In <ref>Auffinger, A., & Arous, G. B. (2013). Complexity of random smooth functions on the high-dimensional sphere. The Annals of Probability, 41(6), 4214-4247.</ref>, an asymptotic evaluation of the complexity of the spherical spin-glass model from condensed matter physics is provided. The authors found that the critical values with low Hamiltonian values have a layered structure that behaves like a Gaussian process. This work shows, under the assumptions listed in the overview, that the objective function used by a neural network is analogous to the Hamiltonian of the spin-glass problem. This means that they exhibit similar behaviour. This is not the first attempt at connecting the spin-glass problem with neural networks but none had attempted to optimize the neural network objective using the theory developed for the spin-glass problem. Thus, this paper is also novel in that respect. <br />
<br />
= Theoretical Analysis =<br />
<br />
Consider a simple fully-connected feed-forward deep network <math>\mathcal{N}</math> with a single output for a binary classification task. The authors use the convention that <math>(H-1)</math> denotes the number of hidden layers in the network (the input layer is the <math>0^{\text{th}}</math> layer and the output layer is the <math>H^{\text{th}}</math> layer). The input <math>X</math> is a vector with <math>d</math> elements, assumed to be random. The variable <math>n_i</math> denotes the number of units in the <math>i^{\text{th}}</math> layer (due to the network restrictions, <math>n_0 = d</math> and <math>n_H = 1</math>). Finally, <math>W_i</math> s the matrix of weights between <math>(i -<br />
1)^{\text{th}}</math> and <math>i^{th}</math> layers of the network and <math>\sigma = \max(0,x)</math> is the activation function. For a random input <math>X</math>, the random network output <math>Y</math> is <math>Y = q\sigma(W_H^{\top}\sigma(W_{H-1}^{\top}\dots\sigma(W_1^{\top}X)))\dots),</math> where <math>q</math> is a normalization factor.<br />
<br />
The key assumption in the theoretical work is the following: for <span>ReLu</span> activation functions <math>\sigma(x)</math> for a random variable <math>x</math>, the output can be seen as being equal to <math>\delta \cdot x</math>, where <math>x</math> is a (not necessarily random) nonzero variable and <math>\delta</math> is a ''new'' random variable that is identically equal to either 0 or 1. With this in mind, the output of the network can be re-expressed as: <math>Y = q\sum_{i=1}^{n_0}X_{i}\sum_{j = 1}^\gamma<br />
A_{i,j}\prod_{k = 1}^{H}w_{i,j}^{(k)},<br />
</math><br />
<br />
where <math>A_{i,j}</math> is a random variable equal to 0 or 1, denoting a path <math>(i,j)</math> to be active (<math>A_{i,j} = 1</math>) or not (<math>A_{i,j} = 0</math>). In this expression, the first summation over <math>i</math> is over the elements of the network input vector, and the second summation over <math>j</math> is over all ''paths'' from <math>X_i</math> to the output. The upper index on this second summation is <math>\gamma =<br />
n_1n_2\dots n_H</math> for all possible paths. The term <math>w_{i,j}^{(k)}</math> refers to the value of the parameter matrix in the layer that corresponds to the hidden vector element that produced the path (i.e. the <math>k^{\text{th}}</math> segment of path indexed with <math>(i,j)</math>); hence why there are <math>H w_{i,j}^{(k)}</math> terms per path.<br />
<br />
From this equation, it can be seen that the output of the <span>ReLu</span> network is polynomial in the weight matrix parameters, and the treatment of <math>A_{i,j}</math> as a random indicator variable allows connections to be made with spin glass models.<br />
<br />
The remainder of the theoretical analysis proceeds as follows:<br />
<br />
<ul><br />
<li><p>The input vector <math>X</math> and all <math>\{A_{i,j}\}</math> are assumed to be random variables, where <math>A_{i,j}</math> is a Bernoulli random variable and all input elements of <math>X</math> are independent.</p></li><br />
<li><p>One further critical assumption is the spherical constraint; all parameter weights <math>w_i</math> (elements of the parameter matrices) satisfy a spherical bound:</p><br />
<p><math>1/\Lambda \sum_i^\Lambda w_i^2 = C</math></p><br />
<p>for some <math>C > 0</math> where <math>\Lambda</math> is the number of parameters.</p></li><br />
<li><p>These assumptions allow the network output to be modeled as a ''spherical spin glass model'', which is a physical model for magnetic dipoles in ferromagnetic materials (a dipole has a magnetization state that is a binary random variable)</p></li><br />
<li><p>Using this assumption, the work by Auffinger et al. (2010) in the field of random matrices and spin glasses is then used to relate the energy states of system Hamiltonians of spin glass models to the critical points of the neural network loss function.</p></li><br />
<li><p>The analysis shows that the critical points of the loss function correspond to different energy bands in the spin glass model; as in a physical system, higher energy states are less probable; while the number of states is infinite, the probability of the system appearing in that state vanishes.</p></li><br />
<li><p>The energy barrier <math>E_{\infty}</math> stems from this analysis, and is given by</p><br />
<p><math>E_{\infty} = E_{\infty}(H) = 2\sqrt{\frac{H-1}{H}}.</math> Auffinger et al. show that critical values of the loss function must relate energies below <math>-\Lambda E_{\infty}</math> if their critical band index (i.e. energy index) is finite.</p></li></ul><br />
<br />
= Experiments =<br />
<br />
The numerical experiments conducted were to verify the theoretical claims of the distribution of critical points around the energy bound <math>E_{\infty}</math>, as well as to correlate the testing and training loss for different numbers of parameters <math>(\Lambda)</math> in the models.<br />
<br />
== MNIST Experiments ==<br />
<br />
<span>ReLu</span> neural networks with a single layer and increasing <math>\Lambda \in<br />
\{25,50,100,250,500 \}</math> were training for multiclass classification on a scaled-down version of the MNIST digit dataset, where each image was downsampled to <math>10<br />
\times 10</math> pixels. For each value of <math>\Lambda</math>, 200 epochs of SGD with a decaying learning rate were used to optimize the parameters in the network. The optimization experiments were performed 1000 times with different initial values for the weight parameters drawn uniformly randomly from <math>[-1,1]</math>.<br />
<br />
= Results =<br />
<br />
To evaluate the distribution of the energy states that each critical point (i.e. solution) of the loss function, the eigenvalues of the Hamiltonian matrix of the loss function was computed for the parameters after the optimization procdure completed. The distribution of the (normalized) index of the energy states is shown below in Fig 1. It can be seen that for all models with different numbers of parameters, the energy states occupied are the low energy bands.<br />
<center><br />
[[File:index_dist.png | frame | center |Fig 1. Distribution of normalized indices of energy states as computed from the system Hamiltonian at the final values of the parameters after the SGD optimization procedure completed.]]<br />
</center><br />
<br />
The final values of the loss function in these experiments are also shown in the histograms in Fig 2. Interestingly, the variance in the loss decreases with increasing numbers of parameters, despite the fact that the spread in the energy state (Fig. 1) increases. This shows that despite the fact that local minima are more prevalent for models with many parameters, there is no appreciable difference in the loss function at these minima: the minima are essentially all equally good in terms of minimizing the objective cost.<br />
<center><br />
[[File:loss_distribution.png | frame | center | Empirical distribution of the values of the loss function over the course of 1000 experiments with different numbers of parameters (Lambda). Each experimental run used a different random initialization of the parameter weights.]]<br />
</center><br />
<br />
Finally, a scatter plot of the training vs testing error for each model is shown in Fig. 3. It can be seen that the correlation between the two errors decreases as the number of parameters increases, suggesting that obtaining a global minimum would not necessarily produce better testing results (and hence still would have a sizeable generalization error).<br />
<center><br />
[[File:train_test_corr.png | frame | center |Fig 3. Scatter plots showing the the correlation between training and testing error for the MNIST dataset experiments. For few parameters in the network, there is a very strong correlation between the two errors. However, for networks with many more parameters, the correlations decrease in strength, suggesting that obtaining the optimal loss (critical point) in the training phase does not improve the generalization error. ]]<br />
</center><br />
<br />
<br />
<br />
=Discussion=<br />
==Power of Deep Neural Nets from the No Free Lunch Point View==<br />
A far out view for the explanation of why Deep Neural Networks has lower probability of bad local minimas is Woodward's <ref>Woodward, John R. "GA or GP? that is not the question." Evolutionary Computation, 2003. CEC'03. The 2003 Congress on. Vol. 2. IEEE, 2003.</ref> paper on why the No Free Lunch Theorem (NFLT) doesn't hold. First the NFLT is a theorem that basically states if one cannot incorporate domain specific knowledge into the search or optimization algorithm, one cannot guarantee the search/optimization algorithm can out perform (in terms of convergence speed) any other search/optimization algorithms, this implies there could be no universal search algorithm that is the best. <br />
<br />
Woodward's argument is that whether you use Genetic Algorithm or Genetic Programming doesn't matter, what matters is the solution mapping. Consider the case the task of [https://en.wikipedia.org/wiki/Symbolic_regression Symbolic Regression] where we have 2 algorithms <math>P</math> and <math>Q</math>, let <math>P_{s} = \{+ba, a, b, +aa, +bb, +ab\}</math> and <math>Q_{s} = \{+ab, +ba, a, b, +aa, +bb\}</math> be the time ordered explored solutions of <math>P</math> and <math>Q</math>. If the problem we face requires a solution <math>+ab</math> then both <math>P</math> and <math>Q</math> discovers the solution on their first try. However for any other solution algorithm <math>P</math> will always outperform <math>Q</math>.<br />
<br />
From the above we might be able to conclude that for deep neural networks the larger or deeper the network size, the more likely the network connections will be able to generate a function that minimizes the loss faster then smaller networks (since all networks were trained for 200 epochs, theoretically a single layer MLP should be able to approximate any function, but practically that could take forever), thus minimizing chances of bad local minimas (similar to how if you have a complex function you have a better chance of fitting data over a simpler function).<br />
<br />
= References =<br />
<references/></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_difficulty_of_training_recurrent_neural_networks&diff=27040on the difficulty of training recurrent neural networks2015-12-02T20:13:25Z<p>Alcateri: /* Background */</p>
<hr />
<div>= Introduction =<br />
<br />
Training Recurrent Neural Networks (RNN) is difficult, one of the most prominent problem in training RNN has been the vanishing and exploding gradient problem <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages<br />
1183–1188. IEEE, 1993.</ref> which prevents nerual networks from learning and fitting the data. In this paper the authors propose a gradient norm cliping stragtegy to deal with the exploding gradients and a soft constraint for the vanishing gradients problem.<br />
<br />
= Background =<br />
<br />
[[Image:rnn_2.png|frame|center|400px|alt=| Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>]]<br />
A generic recurrent neural network has the form:<br />
<br />
<math>x_{t} = F(\mathbf{x}_{t -1}, \mathbf{u}_{t}, \theta)</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{x}_{t}</math> is the state at time <math>t</math></span><br />
* <span><math>\mathbf{u}_{t}</math> is the input at time <math>t</math></span><br />
* <span><math>\theta\,</math> are the parameters</span><br />
* <span><math>F()\,</math> is the function that represents a neuron</span><br />
<br />
In the theoreical sections the authors made use of specific parameterization:<br />
<br />
<math>\mathbf{x}_{t} = \mathbf{W}_{rec} \sigma(\mathbf{x}_{t - 1}) + \mathbf{W}_{in} \mathbf{u}_{t} + b</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{W}_{rec}</math> is the RNN weights matrix</span><br />
* <span><math>\sigma()\,</math> is an element wise function</span><br />
* <span><math>b\,</math> is the bias</span><br />
* <span><math>\mathbf{W}_{in}</math> is the input weights matrix</span><br />
<br />
The following are gradients equations for using the Back Propagation Through Time (BPTT) algorithm, the authors rewrote the equations in order to highlight the exploding gradents problem:<br />
<br />
<math>\frac{\partial \varepsilon}{\partial \theta} = <br />
\sum_{1 \leq t \leq T} \frac{\partial \varepsilon_t}{\partial \theta}</math><br />
<br />
<math>\frac{\partial \varepsilon_{t}}{\partial \theta} = <br />
\sum_{1 \leq k \leq T} <br />
\left(<br />
\frac{\partial \varepsilon_{t}}{\partial x_{t}}<br />
\frac{\partial x_{t}}{\partial x_{k}}<br />
\frac{\partial^{+} x_{k}}{\partial \theta}<br />
\right)</math><br />
<br />
<math>\frac{\partial x_{t}}{\partial x_{k}} =<br />
\prod_{t \leq i \leq k} \frac{\partial x_{i}}{\partial x_{i - 1}} =<br />
\prod_{t \leq i \leq k} <br />
\mathbf{W}^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1}))</math><br />
<br />
Where:<br />
<br />
* <span><math>\varepsilon_{t}</math> is the error obtained from output at time <math>t</math></span><br />
* <span><math>\frac{\partial^{+} \mathbf{x}_{k}}{\partial \theta}</math> is the immediate partial derivative of state <math>\mathbf{x}_{k}</math></span>. For the parametrization above, <math>\frac{\partial^+ \mathbf{x}_k}{\partial \mathbf{W}_{rec}} = \sigma(\mathbf{x}_{k-1})</math>.<br />
<br />
The authors of this paper also distinguish between ''long-term'' and ''short-term'' contributions to the gradient with respect to <math>\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_k}</math>. The contribution is ''long-term'' if <math>k \ll t</math>, and ''short-term'' otherwise.<br />
<br />
== Exploding and Vanishing Gradients ==<br />
=== The Mechanics ===<br />
<br />
It's known that <math> |\sigma^'(x)| </math> is bounded. Let <math>\left|\left|diag(\sigma^'(x_k))\right|\right| \leq \gamma \in R</math>.<br />
<br />
The paper first proves that it is sufficient for <math> \lambda_1 < \frac{1}{\gamma} </math>, where <math> \lambda_1 </math> is the largest singular value of <math> \bold{W}_{rec} </math>, for the vanishing gradient problem to occur. The Jacobian matrix <math> \frac{\partial x_{k+1}}{\partial x_k} </math> is given by <math> \bold{ W}_{rec}^{T}diag(\sigma^'(x_k)) </math>. Then, the 2-norm of this Jacobian is bounded by the product of the norms of the two matrices. This leads to <math> \forall k, ||\frac{\partial{x_{k+1}}}{\partial x_k}|| \leq ||\bold{W}_{rec}^T||||diag(\sigma^'(x_k))|| < \frac{1}{\gamma}\gamma < 1</math><br />
<br />
Let <math>\eta \in R</math> be such that <math>\forall k, ||\frac{\partial {x_{k+1}}}{\partial x_k}|| \leq \eta < 1</math>. By induction over <math>i</math>, we can show that <math>||\frac{\partial \varepsilon_t}{\partial x_t}(\prod_{i=k}^{t-1}{\frac{\partial x_{i+1}}{\partial x_i}})|| \leq \eta^{t-k}||\frac{\partial \varepsilon_t}{\partial x_t}||</math>. Since <math> \eta < 1 </math>, as <math> t-k </math> goes larger, the gradient goes to 0.<br />
<br />
By inverting this proof, it also shows that when the largest singular value <math>\lambda_1 </math> is larger than <math> \frac{1}{\gamma}</math>, we will have exploding gradients (otherwise the long term components would vanish instead of exploding).<br />
<br />
=== From a dynamical systems perspective ===<br />
<br />
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “<math>\theta</math> changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.<br />
<br />
The authors of this paper argues, however, that crossing these bifurcation does not guarantee a sudden chage in gradients if the model state is not in the basin of an attractor. On the other hand if the model is in the basin of an attractor, crossing boundaries between basins will cause the gradients to explode.<br />
<br />
[[Image:dynamic_perspective.png|frame|center|400px|alt=|Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>]]<br />
<br />
The figure above depicts argument, where the x-axis is the parameter <math>b</math> (bias) and the y-axis is the asymptotoc state <math>x_{\infty}</math>, the plot line is the momvent of the final point attractor <math>x_{\infty}</math> as <math>b</math> changes. What this figure represents is the presence of two attractors, one emerging from <math>b_1</math> and another disappearing at <math>b_2</math>. The boundary between the two attractors is denoted with the dashed line, where the blue filled circles is Doya’s (1993) original hypothesis of exploding gradients, where a small change in <math>\theta</math> '''could''' (50% chance) cause <math>x</math> to change suddenly. Where as the unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time <math>0</math>, a small change in <math>\theta</math> would result in a sudden large change in <math>x_{t}</math>.<br />
<br />
=== From a geometric perspective ===<br />
<br />
Aside from a dynamical systems prospective to exploding and vanishing graidents, the authors also considered a geometric perspective, where a simple one hidden unit RNN was cosidered.<br />
<br />
[[Image:geometric_perspective.png|frame|center|400px|alt=|Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>]]<br />
<br />
Reusing the RNN state equation the authors defined:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
By assuming no input, with <math>b = 0</math> and initial state <math>x_{0}</math>, the equation simplifies to:<br />
<br />
<math>x_{t} = W_{rec}^{t} x_{0}</math><br />
<br />
Differentiating the above equation to the first and second order would give:<br />
<br />
<math>\frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0}</math><br />
<br />
<math>\frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0}</math><br />
<br />
Which implies if the first order derivative explodes so will the second derivative. This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away, possibly hindering the learning process. (See figure above).<br />
<br />
= Dealing with Exploding and Vanishing Gradient =<br />
<br />
== Related Work ==<br />
<br />
* <span>'''L1/L2 Regularization''': Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.</span><br />
* <span>'''Teacher Forcing''': Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.</span><br />
* <span>'''LTSM''': The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of <math>1</math>. This solution however does not deal with the exploding gradient</span><br />
* <span>'''Hessian-Free optimizer with structural damping''': Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.</span><br />
* <span>'''Echo State Networks''': <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.</span><br />
<br />
== Scaling Down the Gradients ==<br />
<br />
[[Image:gradient_clipping.png|frame|center|400px]]<br />
<br />
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.<br />
<br />
== Vanishing Gradient Regularization ==<br />
<br />
The vanishing gradient regularizer is as follows:<br />
<br />
<math>\Omega <br />
= \sum_{k} \Omega_{k} <br />
= \sum_{k} <br />
\left( <br />
\frac{<br />
\| <br />
\frac{\delta \varepsilon}{\delta x_{k + 1}} <br />
\frac{\delta x_{k + 1}}{\delta x_{k}}<br />
\|<br />
}<br />
{<br />
\|<br />
\frac{\delta \varepsilon}{\delta x_{k + 1}}<br />
\| <br />
} - 1<br />
\right)^{2}</math><br />
<br />
The vanishing gradient problem occurs when at time <math>t</math> the inputs <math>u</math> may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs <math>u_{t} \dots u_{k}</math> could be increased by increasing the norm of <math>\frac{\delta x_t}{\delta x_{t}}</math>. This imples that in order to increses the <math>\frac{\delta x_t}{\delta x_{t}}</math> norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices <math>\frac{\delta x_{k + 1}}{\delta x_{k}}</math> to preserve norm in the direction of the error <math>\frac{\delta \varepsilon}{\delta x_{k + 1}}</math>.<br />
<br />
= Experimental Results =<br />
<br />
== The Temporal Order Problem ==<br />
<br />
The authors repeated the temporal order problem as the prototypical pathological problem for validating the cliping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an <math>A</math> or a <math>B</math> symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of <math>A</math> and <math>B</math> at the end of the sequence.<br />
<br />
Three different RNN intializations were performed for the experiment:<br />
<br />
* <span>'''sigmoid unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
* <span>'''basic tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1)</math></span><br />
* <span>'''smart tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
<br />
Of the three RNN networks three different optimizer configurations were used:<br />
<br />
* <span>'''MSGD''': Mini-batch Stochastic Gradient Decent</span><br />
* <span>'''MSGD-C''': MSGD with Gradient Clipping</span><br />
* <span>'''MSGD-CR''': MSGD-C with Regularization</span><br />
<br />
Additional model parameters include:<br />
<br />
* <span>Spectral Radius of 0.95</span><br />
* <span><math>b = b_{out} = 0</math></span><br />
* <span>50 hidden neurons</span><br />
* <span>constant learning rate of 0.01</span><br />
* <span>clipping threshold of 1 (only for MSGD-C and MSGD-CR)</span><br />
* <span>regularization weight of 4 (MSGD-CR)</span><br />
<br />
The experiment was performed 5 times, from the figure below we can observe the importance of gradient cliping and the regularizer, in all cases the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding graidents correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important.<br />
<br />
[[Image:experimental_results.png|frame|center|400px]]<br />
<br />
== Other Pathological Problems ==<br />
<br />
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:<br />
<br />
[[Image:experimental_results_2.png|image]]<br />
<br />
The authors did not discuss the experimental results in detail here.<br />
<br />
= Summary =<br />
<br />
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer, their experimental results show that in all cases except for the Penn Treebank dataset, that cliping and regularizer has bested the state of the art for RNN in their respective experiment performances.</div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_difficulty_of_training_recurrent_neural_networks&diff=27039on the difficulty of training recurrent neural networks2015-12-02T20:01:33Z<p>Alcateri: /* Background */</p>
<hr />
<div>= Introduction =<br />
<br />
Training Recurrent Neural Networks (RNN) is difficult, one of the most prominent problem in training RNN has been the vanishing and exploding gradient problem <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages<br />
1183–1188. IEEE, 1993.</ref> which prevents nerual networks from learning and fitting the data. In this paper the authors propose a gradient norm cliping stragtegy to deal with the exploding gradients and a soft constraint for the vanishing gradients problem.<br />
<br />
= Background =<br />
<br />
[[Image:rnn_2.png|frame|center|400px|alt=| Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>]]<br />
A generic recurrent neural network has the form:<br />
<br />
<math>x_{t} = F(x_{t -1}, u_{t}, \theta)\,</math><br />
<br />
Where:<br />
<br />
* <span><math>x_{t}</math>: is the state at time <math>t</math></span><br />
* <span><math>u_{t}</math>: is the input at time <math>t</math></span><br />
* <span><math>\theta</math>: is the parameters</span><br />
* <span><math>F()</math>: is the function that represents a neuron</span><br />
<br />
In the theoreical sections the authors made use of specific parameterization:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b\,</math><br />
<br />
Where:<br />
<br />
* <span><math>W_{rec}</math>: is the RNN weights matrix</span><br />
* <span><math>\sigma</math>: is an element wise function</span><br />
* <span><math>b</math>: is the bias</span><br />
* <span><math>W_{in}</math>: is the input weights matrix</span><br />
<br />
The following are gradients equations for using the Back Propagation Through Time (BPTT) algorithm, the authors rewrote the equations in order to highlight the exploding gradents problem:<br />
<br />
<math>\frac{\partial \varepsilon}{\partial \theta} = <br />
\sum_{1 \leq t \leq T} \frac{\partial \varepsilon}{\partial \theta}</math><br />
<br />
<math>\frac{\partial \varepsilon_{t}}{\partial \theta} = <br />
\sum_{1 \leq k \leq T} <br />
\left(<br />
\frac{\partial \varepsilon_{t}}{\partial x_{t}}<br />
\frac{\partial x_{t}}{\partial x_{k}}<br />
\frac{\partial^{+} x_{k}}{\partial \theta}<br />
\right)</math><br />
<br />
<math>\frac{\partial x_{t}}{\partial x_{k}} =<br />
\prod_{t \leq i \leq k} \frac{\partial x_{i}}{\partial x_{i - 1}} =<br />
\prod_{t \leq i \leq k} <br />
W^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1}))</math><br />
<br />
Where:<br />
<br />
* <span><math>\varepsilon_{t}</math>: is the error obtained from output at time <math>t</math></span><br />
* <span><math>\frac{\partial^{+} x_{k}}{\partial \theta}</math>: is the immediate partial derivative of state <math>x_{k}</math></span><br />
<br />
== Exploding and Vanishing Gradients ==<br />
=== The Mechanics ===<br />
<br />
It's known that <math> |\sigma^'(x)| </math> is bounded. Let <math>||diag(\sigma^'(x_k))|| \leq \gamma \in R</math>.<br />
<br />
The paper first proves that it is sufficient for <math> \lambda_1 < \frac{1}{\gamma} </math>, where <math> \lambda_1 </math> is the largest singular value of <math> \bold{W}_{rec} </math>, for the vanishing gradient problem to occur. The Jacobian matrix <math> \frac{\partial x_{k+1}}{\partial x_k} </math> is given by <math> \bold{ W}_{rec}^{T}diag(\sigma^'(x_k)) </math>. Then, the 2-norm of this Jacobian is bounded by the product of the norms of the two matrices. This leads to <math> \forall k, ||\frac{\partial{x_{k+1}}}{\partial x_k}|| \leq ||\bold{W}_{rec}^T||||diag(\sigma^'(x_k))|| < \frac{1}{\gamma}\gamma < 1</math><br />
<br />
Let <math>\eta \in R</math> be such that <math>\forall k, ||\frac{\partial {x_{k+1}}}{\partial x_k}|| \leq \eta < 1</math>. By induction over <math>i</math>, we can show that <math>||\frac{\partial \varepsilon_t}{\partial x_t}(\prod_{i=k}^{t-1}{\frac{\partial x_{i+1}}{\partial x_i}})|| \leq \eta^{t-k}||\frac{\partial \varepsilon_t}{\partial x_t}||</math>. Since <math> \eta < 1 </math>, as <math> t-k </math> goes larger, the gradient goes to 0.<br />
<br />
By inverting this proof, it also shows that when the largest singular value <math>\lambda_1 </math> is larger than <math> \frac{1}{\gamma}</math>, we will have exploding gradients (otherwise the long term components would vanish instead of exploding).<br />
<br />
=== From a dynamical systems perspective ===<br />
<br />
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “<math>\theta</math> changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.<br />
<br />
The authors of this paper argues, however, that crossing these bifurcation does not guarantee a sudden chage in gradients if the model state is not in the basin of an attractor. On the other hand if the model is in the basin of an attractor, crossing boundaries between basins will cause the gradients to explode.<br />
<br />
[[Image:dynamic_perspective.png|frame|center|400px|alt=|Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>]]<br />
<br />
The figure above depicts argument, where the x-axis is the parameter <math>b</math> (bias) and the y-axis is the asymptotoc state <math>x_{\infty}</math>, the plot line is the momvent of the final point attractor <math>x_{\infty}</math> as <math>b</math> changes. What this figure represents is the presence of two attractors, one emerging from <math>b_1</math> and another disappearing at <math>b_2</math>. The boundary between the two attractors is denoted with the dashed line, where the blue filled circles is Doya’s (1993) original hypothesis of exploding gradients, where a small change in <math>\theta</math> '''could''' (50% chance) cause <math>x</math> to change suddenly. Where as the unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time <math>0</math>, a small change in <math>\theta</math> would result in a sudden large change in <math>x_{t}</math>.<br />
<br />
=== From a geometric perspective ===<br />
<br />
Aside from a dynamical systems prospective to exploding and vanishing graidents, the authors also considered a geometric perspective, where a simple one hidden unit RNN was cosidered.<br />
<br />
[[Image:geometric_perspective.png|frame|center|400px|alt=|Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>]]<br />
<br />
Reusing the RNN state equation the authors defined:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
By assuming no input, with <math>b = 0</math> and initial state <math>x_{0}</math>, the equation simplifies to:<br />
<br />
<math>x_{t} = W_{rec}^{t} x_{0}</math><br />
<br />
Differentiating the above equation to the first and second order would give:<br />
<br />
<math>\frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0}</math><br />
<br />
<math>\frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0}</math><br />
<br />
Which implies if the first order derivative explodes so will the second derivative. This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away, possibly hindering the learning process. (See figure above).<br />
<br />
= Dealing with Exploding and Vanishing Gradient =<br />
<br />
== Related Work ==<br />
<br />
* <span>'''L1/L2 Regularization''': Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.</span><br />
* <span>'''Teacher Forcing''': Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.</span><br />
* <span>'''LTSM''': The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of <math>1</math>. This solution however does not deal with the exploding gradient</span><br />
* <span>'''Hessian-Free optimizer with structural damping''': Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.</span><br />
* <span>'''Echo State Networks''': <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.</span><br />
<br />
== Scaling Down the Gradients ==<br />
<br />
[[Image:gradient_clipping.png|frame|center|400px]]<br />
<br />
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.<br />
<br />
== Vanishing Gradient Regularization ==<br />
<br />
The vanishing gradient regularizer is as follows:<br />
<br />
<math>\Omega <br />
= \sum_{k} \Omega_{k} <br />
= \sum_{k} <br />
\left( <br />
\frac{<br />
\| <br />
\frac{\delta \varepsilon}{\delta x_{k + 1}} <br />
\frac{\delta x_{k + 1}}{\delta x_{k}}<br />
\|<br />
}<br />
{<br />
\|<br />
\frac{\delta \varepsilon}{\delta x_{k + 1}}<br />
\| <br />
} - 1<br />
\right)^{2}</math><br />
<br />
The vanishing gradient problem occurs when at time <math>t</math> the inputs <math>u</math> may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs <math>u_{t} \dots u_{k}</math> could be increased by increasing the norm of <math>\frac{\delta x_t}{\delta x_{t}}</math>. This imples that in order to increses the <math>\frac{\delta x_t}{\delta x_{t}}</math> norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices <math>\frac{\delta x_{k + 1}}{\delta x_{k}}</math> to preserve norm in the direction of the error <math>\frac{\delta \varepsilon}{\delta x_{k + 1}}</math>.<br />
<br />
= Experimental Results =<br />
<br />
== The Temporal Order Problem ==<br />
<br />
The authors repeated the temporal order problem as the prototypical pathological problem for validating the cliping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an <math>A</math> or a <math>B</math> symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of <math>A</math> and <math>B</math> at the end of the sequence.<br />
<br />
Three different RNN intializations were performed for the experiment:<br />
<br />
* <span>'''sigmoid unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
* <span>'''basic tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1)</math></span><br />
* <span>'''smart tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
<br />
Of the three RNN networks three different optimizer configurations were used:<br />
<br />
* <span>'''MSGD''': Mini-batch Stochastic Gradient Decent</span><br />
* <span>'''MSGD-C''': MSGD with Gradient Clipping</span><br />
* <span>'''MSGD-CR''': MSGD-C with Regularization</span><br />
<br />
Additional model parameters include:<br />
<br />
* <span>Spectral Radius of 0.95</span><br />
* <span><math>b = b_{out} = 0</math></span><br />
* <span>50 hidden neurons</span><br />
* <span>constant learning rate of 0.01</span><br />
* <span>clipping threshold of 1 (only for MSGD-C and MSGD-CR)</span><br />
* <span>regularization weight of 4 (MSGD-CR)</span><br />
<br />
The experiment was performed 5 times, from the figure below we can observe the importance of gradient cliping and the regularizer, in all cases the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding graidents correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important.<br />
<br />
[[Image:experimental_results.png|frame|center|400px]]<br />
<br />
== Other Pathological Problems ==<br />
<br />
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:<br />
<br />
[[Image:experimental_results_2.png|image]]<br />
<br />
The authors did not discuss the experimental results in detail here.<br />
<br />
= Summary =<br />
<br />
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer, their experimental results show that in all cases except for the Penn Treebank dataset, that cliping and regularizer has bested the state of the art for RNN in their respective experiment performances.</div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26843learning Fast Approximations of Sparse Coding2015-11-23T17:40:13Z<p>Alcateri: /* Coordinate Descent */ Added CoD algorithm and comment about similarity to ISTA</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the desire to produce accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which utilizes these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> \, L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent (CoD) adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
The CoD algorithm is presented below:<br />
<br />
<blockquote><br />
<math>\textbf{function} \, \textbf{CoD}\left(X, Z, W_d, S, \alpha\right)</math><br />
: <math>\textbf{Require:} \,S = I - W_d^T W_d</math><br />
: <math>\textbf{Initialize:} \,Z = 0; B = W_d^TX</math><br />
: <math> \textbf{repeat}</math><br />
:: <math>\bar{Z} = h_{\alpha}\left(B\right)</math><br />
:: <math> \,k = \mbox{ index of largest component of} \left|Z - \bar{Z}\right|</math><br />
:: <math> \forall j \in \left[1, m\right]: B_j = B_j + S_{jk}\left(\bar{Z}_k - Z_k\right)</math><br />
:: <math> Z_k = \bar{Z}_k</math><br />
: <math>\textbf{until}\,\text{change in}\,Z\,\text{is below a threshold}</math> <br />
: <math> Z = h_{\alpha}\left(B\right)</math><br />
<math> \textbf{end} \, \textbf{function} </math><br />
</blockquote><br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. This algorithm has a similar feedback concept to ISTA, but can it can expressed as a linear feedback operation with a very sparse matrix (since only one component is updated at a time). Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are inherently fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
The algorithm for LCoD can be summarized as <br />
<br />
<br />
[[File:Q12.png]]<br />
<br />
= Empirical Performance =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was used in assessing whether improved error-rates in code-prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent. <br />
<br />
Figure 1 suggests that, for a small number of iterations, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA. <br />
<br />
<center><br />
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]<br />
</center><br />
<br />
Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their LCoD values prior to training. <br />
<br />
<center><br />
[[File:LCOD.png |frame | center |Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400. Open Circles indicate that matrices have been initialized with their LCoD values prior to training ]]<br />
</center><br />
<br />
== MNIST Digits ==<br />
<br />
Here, the authors examined whether these coding approximations could be effectively applied in classification tasks. Evaluation was conducted using both the entire 28x28-pixel images to create 784-dimensional codes, as well as extracted 16x16-pixel patches for codes with 256 components. In addition to the finding that increasing the number of iterations reduced classification error across all procedures, it was observed that Learned Coordinate Descent approached the optimal error rate after only 10 iterations. <br />
<br />
A complete feature vector consisted of 25 concatenated such vectors, extracted<br />
from all 16 × 16 patches shifted by 3 pixels on the input.<br />
The features were extracted for all digits using<br />
CoD with exact inference, CoD with a fixed number of<br />
iterations, and LCoD. Additionally a version of CoD<br />
(denoted CoD’) used inference with a fixed number<br />
of iterations during training of the filters, and used<br />
the same number of iterations during test (same complexity<br />
as LCoD). A logistic regression classifier was<br />
trained on the features thereby obtained.<br />
<br />
Classification errors on the test set are shown in the following figures . While the error rate decreases with the<br />
number of iterations for all methods, the error rate<br />
of LCoD with 10 iterations is very close to the optimal<br />
(differences in error rates of less than 0.1% are<br />
insignificant on MNIST)<br />
<br />
[[File:T1.png]]<br />
<br />
MNIST results with 784-D sparse codes<br />
<br />
MNIST results with 25 256-D sparse codes extracted<br />
from 16 × 16 patches every 3 pixels<br />
<br />
<br />
[[File:T2.png]]<br />
<br />
= References =<br />
References<br />
Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding<br />
algorithm with application to waveletbased<br />
image deblurring. ICASSP’09, pp. 693–696, 2009.<br />
Chen, S.S., Donoho, D.L., and Saunders, M.A. Atomic<br />
decomposition by basis pursuit. SIAM review, 43(1):<br />
129–159, 2001.<br />
<br />
Daubechies, I, Defrise, M., and De Mol, C. An iterative<br />
thresholding algorithm for linear inverse problems with a<br />
sparsity constraint. Comm. on Pure and Applied Mathematics,<br />
57:1413–1457, 2004.<br />
<br />
Donoho, D.L. and Elad, M. Optimally sparse representation<br />
in general (nonorthogonal) dictionaries via ℓ<br />
1 minimization.<br />
PNAS, 100(5):2197–2202, 2003.<br />
<br />
Elad, M. and Aharon, M. Image denoising via learned dictionaries<br />
and sparse representation. In CVPR’06, 2006.<br />
Hale, E.T., Yin, W., and Zhang, Y. Fixed-point continuation<br />
for l1-minimization: Methodology and convergence.<br />
SIAM J. on Optimization, 19:1107, 2008.<br />
Hoyer, P. O. Non-negative matrix factorization with<br />
sparseness constraints. JMLR, 5:1457–1469, 2004.<br />
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun,<br />
Y. What is the best multi-stage architecture for object<br />
recognition? In ICCV’09. IEEE, 2009.<br />
<br />
Kavukcuoglu, Koray, Ranzato, Marc’Aurelio, and LeCun,<br />
Yann. Fast inference in sparse coding algorithms<br />
with applications to object recognition. Technical Report<br />
CBLL-TR-2008-12-01, Computational and Biological<br />
Learning Lab, Courant Institute, NYU, 2008.<br />
<br />
Lee, H., Battle, A., Raina, R., and Ng, A.Y. Efficient<br />
sparse coding algorithms. In NIPS’06, 2006.<br />
<br />
Lee, H., Chaitanya, E., and Ng, A. Y. Sparse deep belief<br />
net model for visual area v2. In Advances in Neural<br />
Information Processing Systems, 2007.<br />
<br />
Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y. Convolutional<br />
deep belief networks for scalable unsupervised<br />
learning of hierarchical representations. In International<br />
Conference on Machine Learning. ACM New York, 2009.<br />
Li, Y. and Osher, S. Coordinate descent optimization for<br />
l1 minimization with application to compressed sensing;<br />
a greedy algorithm. Inverse Problems and Imaging, 3<br />
(3):487–503, 2009.<br />
<br />
Mairal, J., Elad, M., and Sapiro, G. Sparse representation<br />
for color image restoration. IEEE T. Image Processing,<br />
17(1):53–69, January 2008.<br />
<br />
Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online<br />
dictionary learning for sparse coding. In ICML’09, 2009.<br />
Olshausen, B.A. and Field, D. Emergence of simple-cell<br />
receptive field properties by learning a sparse code for<br />
natural images. Nature, 381(6583):607–609, 1996.<br />
<br />
Ranzato, M., Huang, F.-J., Boureau, Y.-L., and LeCun,<br />
Y. Unsupervised learning of invariant feature hierarchies<br />
with applications to object recognition. In CVPR’07.<br />
IEEE, 2007a.<br />
<br />
Ranzato, M.-A., Boureau, Y.-L., Chopra, S., and LeCun,<br />
Y. A unified energy-based framework for unsupervised<br />
learning. In AI-Stats’07, 2007b.<br />
<br />
Rozell, C.J., Johnson, D.H, Baraniuk, R.G., and Olshausen,<br />
B.A. Sparse coding via thresholding and local<br />
competition in neural circuits. Neural Computation, 20:<br />
2526–2563, 2008.<br />
<br />
Vonesch, C. and Unser, M. A fast iterative thresholding algorithm<br />
for wavelet-regularized deconvolution. In IEEE<br />
ISBI, 2007.<br />
<br />
Wu, T.T. and Lange, K. Coordinate descent algorithms<br />
for lasso penalized regression. Ann. Appl. Stat, 2(1):<br />
224–244, 2008.<br />
<br />
Yang, Jianchao, Yu, Kai, Gong, Yihong, and Huang,<br />
Thomas. Linear spatial pyramid matching using sparse<br />
coding for image classification. In CVPR’09, 2009.<br />
Yu, Kai, Zhang, Tong, and Gong, Yihong. Nonlinear learning<br />
using local coordinate coding. In NIPS’09, 2009.</div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=show,_Attend_and_Tell:_Neural_Image_Caption_Generation_with_Visual_Attention&diff=26693show, Attend and Tell: Neural Image Caption Generation with Visual Attention2015-11-20T04:21:38Z<p>Alcateri: Adding Related Work section</p>
<hr />
<div>= Introduction =<br />
<br />
This paper<ref><br />
Xu, Kelvin, et al. [http://arxiv.org/pdf/1502.03044v2.pdf "Show, attend and tell: Neural image caption generation with visual attention."] arXiv preprint arXiv:1502.03044 (2015).<br />
</ref> introduces an attention based model that automatically learns to describe the content of images. It is able to focus on salient parts of the image while generating the corresponding word in the output sentence. A visualization is provided showing which part of the image was attended to to generate each specific word in the output. This can be used to get a sense of what is going on in the model and is especially useful for understanding the kinds of mistakes it makes. The model is tested on three datasets, Flickr8k, Flickr30k, and MS COCO.<br />
<br />
= Motivation =<br />
Caption generation and compressing huge amounts of salient visual information into descriptive language were recently improved by combination of convolutional neural networks and recurrent neural networks. . Using representations from the top layer of a convolutional net that distill information in image down to the most salient objects can lead to losing information which could be useful for richer, more descriptive captions. Retaining this information using more low-level representation was the motivation for the current work.<br />
<br />
= Contributions = <br />
<br />
* Two attention-based image caption generators using a common framework. A "soft" deterministic attention mechanism and a "hard" stochastic mechanism.<br />
* Show how to gain insight and interpret results of this framework by visualizing "where" and "what" the attention focused on.<br />
* Quantitatively validate the usefulness of attention in caption generation with state of the art performance on three datasets (Flickr8k, Flickr30k, and MS COCO)<br />
<br />
= Related Work =<br />
<br />
Many methods proposed for caption generation are based on recurrent neural networks, inspired by successful sequence-to-sequence training with neural networks for machine translation.<ref>Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). [http://arxiv.org/pdf/1406.1078.pdf Learning phrase representations using rnn encoder-decoder for statistical machine translation.] arXiv preprint arXiv:1406.1078.</ref> Image cation generation works well as a translation problem because it is translating an image to a sentence. <br />
<br />
The first attempt at using neural networks for this task was a multinomial log-bilinear model.<ref>Kiros, Ryan, Ruslan Salakhutdinov, and Rich Zemel. [http://machinelearning.wustl.edu/mlpapers/paper_files/icml2014c2_kiros14.pdf Multimodal neural language models.] Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014.</ref> This work was augmented to allow a natural way of performing ranking and generation.<ref>Kiros, R., Salakhutdinov, R., & Zemel, R. S. (2014). [http://arxiv.org/pdf/1411.2539.pdf Unifying visual-semantic embeddings with multimodal neural language models.] arXiv preprint arXiv:1411.2539.</ref> Others began to replace the feedforward neural network with a recurrent one<ref>Mao, J., Xu, W., Yang, Y., Wang, J., & Yuille, A. (2014). [http://arxiv.org/pdf/1412.6632.pdf Deep captioning with multimodal recurrent neural networks (m-rnn).] arXiv preprint arXiv:1412.6632.</ref>, and some began incorporating LSTM RNNs.<ref>Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2014). [http://arxiv.org/pdf/1411.4555.pdf? Show and tell: A neural image caption generator.] arXiv preprint arXiv:1411.4555.</ref> These works represent images as a single feature vector from the top layer oof a pre-trained convolutional network. On the contrary, Karpathy & Li<ref>Karpathy, A., & Fei-Fei, L. (2014). [http://arxiv.org/pdf/1412.2306.pdf Deep visual-semantic alignments for generating image descriptions.] arXiv preprint arXiv:1412.2306.</ref> use a bidirectional RNN to learn a joint embedding space for ranking and generation. This paper differs from these approaches in that it does not explicitly use object detectors but instead learns latent alignments from scratch.<br />
<br />
= Model =<br />
<br />
The model takes in a single image and generates a caption of arbitrary length. The caption is a sequence of [http://stackoverflow.com/questions/17469835/one-hot-encoding-for-machine-learning one-hot encoded words] (binary vector) from a given vocabulary.<br />
<br />
[[File:AttentionOneHotEncoding.png]]<br />
<br />
[[File:AttentionNetwork.png]]<br />
<br />
== Encoder: Convolutional Features ==<br />
<br />
Feature vectors are extracted from a convolutional neural network to use as input for the attention mechanism. The extractor produces ''L'' D-dimensional vectors corresponding to a part of the image.<br />
<br />
[[File:AttentionAnnotationVectors.png]]<br />
<br />
Unlike previous work, features are extracted from a lower convolutional layer instead of a fully connected layer. This allows the feature vectors to have a correspondence with portions of the 2D image.<br />
<br />
== Decoder: Long Short-Term Memory Network ==<br />
<br />
[[File:AttentionLSTM.png]]<br />
<br />
The purpose of the LSTM is to output a sequence of 1-of-K encodings represented as:<br />
<br />
<math>y={y_1,\dots,y_C},y_i\in\mathbb{R}^K</math>, where C is the length of the caption and K is the vocabulary size<br />
<br />
To generate this sequence of outputs, a set of feature vectors was extracted from the image using a convolutional neural network and represented as:<br />
<br />
<math>a={a_1,\dots,a_L},a_i\in\mathbb{R}^D</math>, where D is the dimension of the feature vector extracted by the convolutional neural network<br />
<br />
Let <math>T_{s,t} : \mathbb{R}^s -> \mathbb{R}^t </math> be a simple affine transformation, i.e.<math>\,Wx + b</math> for some projection weight matrix W and some bias vector b learned as parameters in the LSTM.<br />
<br />
The equations for the LSTM can then be simplified as:<br />
<br />
<math>\begin{pmatrix}i_t\\f_t\\o_t\\g_t\end{pmatrix}=\begin{pmatrix}\sigma\\\sigma\\\sigma\\tanh\end{pmatrix}T_{D+m+n,n}\begin{pmatrix}Ey_{t-1}\\h_{t-1}\\\hat z_{t}\end{pmatrix}</math><br />
<br />
<math>c_t=f_t\odot c_{t-1} + i_t \odot g_t</math><br />
<br />
<math>h_t=o_t \odot tanh(c_t)</math><br />
<br />
where <math>\,i_t,f_t,o_t,g_t,c_t,h_t</math> corresponds the values and gate labels in the diagram. Additionally, <math>\,\sigma</math> is the logistic sigmoid function and both it and <math>\,tanh</math> are applied element wise in the first equation.<br />
<br />
<br />
At each time step, the LSTM outputs the relative probability of every single word in the vocabulary given a context vector, the previous hidden state and the previously generated word. This is done through additional feedforward layers between the LSTM layers and the output layer, known as deep output layer setup, that take the state of the LSTM <math>\,h_t</math> and applies additional transformations to the get relative probability:<br />
<br />
<math>p(y_t,a,y_1^{t-1})\propto exp(L_o(Ey_{t-1}+L_hh_t+L_z\hat z_t))</math><br />
<br />
where <math>L_o\in\mathbb{R}^{Kxm},L_h\in\mathbb{R}^{mxn},L_z\in\mathbb{R}^{mxD},E\in\mathbb{R}^{mxK}</math> are randomly initialized parameters that are learned through training the LSTM. This series of matrix and vector multiplication then results in a vector of dimension K where each element represents the relative probability of the word indexed with that element being next in the sequence of outputs.<br />
<br />
<br />
<math>\hat{z}</math> is the context vector which is a function of the feature vectors <math>a={a_1,\dots,a_L}</math> and the attention model as discussed in the next section.<br />
<br />
== Attention: Two Variants ==<br />
<br />
The attention algorithm is one of the arguments that influences the state of the LSTM. There are two variants of the attention algorithm used: stochastic "hard" and deterministic "soft" attention. The visual differences between the two can be seen in the "Properties" section.<br />
<br />
Stochastic "hard" attention means learning to maximize the context vector <math>\hat{z}</math> from a combination of a one-hot encoded variable <math>s_{t,i}</math> and the extracted features <math>a_{i}</math>. This is called "hard" attention, because a hard choice is made at each feature, however it is stochastic since <math>s_{t,i}</math> is chosen from a mutlinoulli distribution. In this approach the location variable <math>s_t</math> is presented as where the model decides to focus attention when generating the <math>t^{th}</math> word. [http://cs.brown.edu/courses/cs195-5/spring2012/lectures/2012-01-31_probabilityDecisions.pdf (see page 11 for an explanation of the distribution of this link)].<br />
<br />
Learning stochastic attention requires sampling the attention location st each time, instead we can take the expectation of the context vector <math>zˆt</math> directly and formulate a deterministic attention model by computing a soft attention weighted annotation vector<ref name=BaD><br />
Bahdanau, Dzmitry, ''et al'' [http://arxiv.org/pdf/1409.0473.pdf"Neural machine translation by jointly learning to align and translate."] in arXiv, (2014).<br />
</ref>. Deterministic soft-attention means learning by maximizing the expectation of the context vector. It is deterministic, since <math>s_{t,i}</math> is not picked from a distribution and it is soft since the individual choices are not optimized, but the whole distribution.<br />
<br />
The actual optimization methods for both of these attention methods are outside the scope of this summary.<br />
<br />
== Properties ==<br />
<br />
"where" the network looks next depends on the sequence of words that has already been generated.<br />
<br />
The attention framework learns latent alignments from scratch instead of explicitly using object detectors. This allows the model to go beyond "objectness" and learn to attend to abstract concepts.<br />
<br />
[[File:AttentionHighlights.png]]<br />
<br />
== Training ==<br />
<br />
Each mini-batch used in training contained captions with similar length. This is because the implementation requires time proportional to the longest length sentence per update, so having all of the sentences in each update have similar length improved the convergence speed dramatically.<br />
<br />
Two regularization techniques were used, drop out and early stopping on BLEU score. Since BLEU is the more commonly reported metric, BLEU is used on the validation set for model selection.<br />
<br />
The MS COCO dataset has more than 5 reference sentences for some of the images, while the Flickr datasets have exactly 5. For consistency, the reference sentences for all images in the MS COCO dataset was truncated to 5. There was also some basic tokenization applied to the MS COCO dataset to be consistent with the tokenization in the Flickr datasets.<br />
<br />
On the largest dataset (MS COCO) the attention model took less than 3 days to train on NVIDIA Titan Black GPU.<br />
<br />
= Results =<br />
<br />
Results reported with the [https://en.wikipedia.org/wiki/BLEU BLEU] and [https://en.wikipedia.org/wiki/METEOR METEOR] metrics. BLEU is one of the most common metrics for translation tasks, but due to some criticism of the metric, another is used as well. Both of these metrics are designed for evaluating machine translation, which is typically from one language to another. Caption generation can be thought of as analogous to translation, where the image is a sentence in the original 'language' and the caption is its translation to English (or another language, but in this case the captions are only in English). <br />
<br />
[[File:AttentionResults.png]]<br />
<br />
[[File:AttentionGettingThingsRight.png]]<br />
<br />
[[File:AttentionGettingThingsWrong.png]]<br />
<br />
=References=<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=the_Manifold_Tangent_Classifier&diff=26567the Manifold Tangent Classifier2015-11-19T01:43:29Z<p>Alcateri: /* Discussion */</p>
<hr />
<div>== Introduction ==<br />
<br />
The goal in many machine learning problems is to extract information from data with minimal prior knowledge<ref name = "main"> Rifai, S., Dauphin, Y. N., Vincent, P., Bengio, Y., & Muller, X. (2011). [http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2011_1240.pdf The manifold tangent classifier.] In Advances in Neural Information Processing Systems (pp. 2294-2302). </ref> These algorithms are designed to work on numerous problems which they may not be specifically tailored towards, thus domain-specific knowledge is generally not incorporated into the models. However, some generic "prior" hypotheses are considered to aid in the general task of learning, and three very common ones are presented below:<br />
<br />
# The '''semi-supervised learning hypothesis''': This states that knowledge of the input distribution <math>p\left(x\right)</math> can aid in learning the output distribution <math>p\left(y|x\right)</math> .<ref>Lasserre, J., Bishop, C. M., & Minka, T. P. (2006, June). [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1640745 Principled hybrids of generative and discriminative models.] In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on (Vol. 1, pp. 87-94). IEEE.</ref> This hypothesis lends credence to not only the theory of strict semi-supervised learning, but also unsupervised pretraining as a method of feature extraction.<ref> Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). [http://www.mitpressjournals.org/doi/pdf/10.1162/neco.2006.18.7.1527 A fast learning algorithm for deep belief nets.] Neural computation, 18(7), 1527-1554.</ref><ref>Erhan, D., Bengio, Y., Courville, A., Manzagol, P. A., Vincent, P., & Bengio, S. (2010). [http://delivery.acm.org/10.1145/1760000/1756025/p625-erhan.pdf?ip=129.97.89.222&id=1756025&acc=PUBLIC&key=FD0067F557510FFB%2E9219CF56F73DCF78%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=561475515&CFTOKEN=96787671&__acm__=1447710319_1ea806f74c2b3b6959e97d9d0e03d533 Why does unsupervised pre-training help deep learning?.] The Journal of Machine Learning Research, 11, 625-660.</ref><br />
# The '''unsupervised manifold hypothesis''': This states that real-world data presented in high-dimensional spaces is likely to concentrate around a low-dimensional sub-manifold.<ref>Cayton, L. (2005). [http://www.vis.lbl.gov/~romano/mlgroup/papers/manifold-learning.pdf Algorithms for manifold learning.] Univ. of California at San Diego Tech. Rep, 1-17.</ref><br />
# The '''manifold hypothesis for classification''': This states that points of different classes are likely to concentrate along different sub-manifolds, separated by low-density regions of the input space.<ref name = "main"></ref><br />
<br />
The recently-proposed Contractive Auto-Encoder (CAE) algorithm has shown success in the task of unsupervised feature extraction,<ref name = "CAE">Rifai, S., Vincent, P., Muller, X., Glorot, X., & Bengio, Y. (2011). [http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf Contractive auto-encoders: Explicit invariance during feature extraction.] In Proceedings of the 28th International Conference on Machine Learning (ICML-11) (pp. 833-840).</ref> with its successful application in pre-training of Deep Neural Networks (DNN) an illustration of the merits of adopting '''Hypothesis 1'''. CAE also yields a mostly contractive mapping that is locally only sensitive to a few input directions, which implies that it models a lower-dimensional manifold (exploiting '''Hypothesis 2''') since the directions of sensitivity are in the tangent space of the manifold. <br />
<br />
This paper furthers the previous work by using the information about the tangent spaces by considering '''Hypothesis 3''': it extracts basis vectors for the local tangent space around each training point from the parameters of the CAE. Then, older supervised classification algorithms that exploit tangent directions as domain-specific prior knowledge can be used on the tangent spaces generated by CAE for fine-tuning the overall classification network. This approach seamlessly integrates all three of the above hypotheses and produces record-breaking results (for 2011) on image classification.<br />
<br />
== Contractive Auto-Encoders (CAE) and Tangent Classification ==<br />
<br />
The problem is to find a non-linear feature extractor for a dataset <math>\mathcal{D} = \{x_1, \ldots, x_n\}</math>, where <math>x_i \in \mathbb{R}^d</math> are i.i.d. samples from an unknown distribution <math> p\left(x\right)</math>.<br />
<br />
=== Traditional Auto-Encoders === <br />
<br />
A traditional auto-encoder learns an '''encoder''' function <math>h: \mathbb{R}^d \rightarrow \mathbb{R}^{d_h}</math> along with a '''decoder''' function <math>g: \mathbb{R}^{d_h} \rightarrow \mathbb{R}</math>, represented as <math>r = g\left(h\left(x\right)\right) </math>. <math>h\,</math> maps input <math>x\,</math> to the hidden input space, and <math>g\,</math> reconstructs <math>x\,</math>. When <math>L\left(x,g\left(h\left(x\right)\right)\right)</math> denotes the average reconstruction error, the objective function being optimized to learn the parameters <math>\theta\,</math> of the encoder/decoder is as follows:<br />
<br />
:<math> \mathcal{J}_{AE}\left(\theta\right) = \sum_{x\in\mathcal{D}}L\left(x,g\left(h\left(x\right)\right)\right) </math><br />
<br />
The form of the '''encoder''' is <math>h\left(x\right) = s\left(Wx + b_h\right)</math>, where <math>s\left(z\right) = \frac{1}{1 + e^{-z}}</math> is the element-wise logistic sigmoid. <math>W \in \mathbb{R}^{d_h \times d} </math> and <math>b_h \in \mathbb{R}^{d_h}</math> are the parameters (weight matrix and bias vector, respectively). The form of the '''decoder''' is <math>r = g\left(h\left(x\right)\right) = s_2\left(W^Th\left(x\right)+b_r\right)</math>, where <math>\,s_2 = s</math> or the identity. The weight matrix <math>W^T\,</math> is shared with the encoder, with the only new parameter being the bias vector <math>b_r \in \mathbb{R}^d</math>.<br />
<br />
The '''loss function''' can either be the squared error <math>L\left(x,r\right) = \|x - r\|^2</math> or the Bernoulli cross-entropy, given by: <br />
<br />
:<math> L\left(x, r\right) = -\sum_{i=1}^d \left[x_i \mbox{log}\left(r_i\right) + \left(1 - x_i\right)\mbox{log}\left(1 - r_i\right)\right]</math><br />
<br />
=== First- and Higher-Order Contractive Auto-Encoders ===<br />
<br />
==== Additional Penalty on Jacobian ==== <br />
<br />
The Contractive Auto-Encoder (CAE), proposed by Rifai et al.<ref name = "CAE"></ref>, encourages robustness of <math>h\left(x\right)</math> to small variations in <math>x</math> by penalizing the Frobenius norm of the encoder's Jacobian <math>J\left(x\right) = \frac{\partial h}{\partial x}\left(x\right)</math>. The new objective function to be minimized is:<br />
<br />
:<math> \mathcal{J}_{CAE}\left(\theta\right) = \sum_{x\in\mathcal{D}}L\left(x,g\left(h\left(x\right)\right)\right) + \lambda\|J\left(x\right)\|_F^2 </math><br />
<br />
where <math>\lambda</math> is a non-negative regularization parameter. We can compute the <math>j^{th}</math> row of the Jacobian of the sigmoidal encoder quite easily using the <math>j^{th}</math> row of <math>W</math>:<br />
<br />
:<math> J\left(x\right)_j = \frac{\partial h_j\left(x\right)}{\partial x} = h_j\left(x\right)\left(1 - h_j\left(x\right)\right)W_j</math><br />
<br />
==== Additional Penalty on Hessian ====<br />
<br />
It is also possible to penalize higher-order derivatives by approximating the Hessian (explicit computation of the Hessian is costly). It is sufficient to penalize the difference between <math>J\left(x\right)</math> and <math>J\left(x + \varepsilon\right)</math> where <math>\,\varepsilon </math> is small, as this represents the rate of change of the Jacobian. This yields the "CAE+H" variant, with objective function as follows:<br />
<br />
:<math> \mathcal{J}_{CAE+H}\left(\theta\right) = \mathcal{J}_{CAE}\left(\theta\right) + \gamma\sum_{x \in \mathcal{D}}\mathbb{E}_{\varepsilon\sim\mathcal{N}\left(0,\sigma^2I\right)} \left[\|J\left(x\right) - J\left(x + \varepsilon\right)\|^2\right] </math><br />
<br />
The expectation above, in practice, is taken over stochastic samples of the noise variable <math>\varepsilon\,</math> at each stochastic gradient descent step. <math>\gamma\,</math> is another regularization parameter. This formulation will be the one used within this paper.<br />
<br />
=== Characterizing the Tangent Bundle Captured by a CAE ===<br />
<br />
Although the regularization term encourages insensitivity of <math>h(x)</math> in all input space directions, the pressure to form an accurate reconstruction counters this somewhat, and the result is that <math>h(x)</math> is only sensitive to a few input directions necessary to distinguish close-by training points.<ref name = "CAE"></ref> Geometrically, the interpretation is that these directions span the local tangent space of the underlying manifold the characterizes the input data. <br />
<br />
==== Geometric Terms ====<br />
<br />
* '''Tangent Bundle''': The tangent bundle of a smooth manifold is the manifold along with the set of tangent planes taken at all points in it.<br />
* '''Chart''': A local Euclidean coordinate system equipped to a tangent plane. Each tangent plane has its own chart.<br />
* '''Atlas''': A collection of local charts.<br />
<br />
==== Conditions for Feature Mapping to Define an Atlas on a Manifold ====<br />
<br />
To obtain a proper atlas of charts, <math>h</math> must be a local diffeomorphism (locally smooth and invertible). Since the sigmoidal mapping is smooth, <math>\,h</math> is guaranteed to be smooth. To determine injectivity of <math>h\,</math>, consider the following, <math>\forall x_i, x_j \in \mathcal{D}</math>:<br />
<br />
:<math><br />
\begin{align}<br />
h(x_i) = h(x_j) &\Leftrightarrow s\left(Wx_i + b_h\right) = s\left(Wx_j + b_h\right) \\<br />
& \Leftrightarrow Wx_i + b_h = Wx_j + b_h \mbox{, since } s \mbox{ is invertible} \\<br />
& \Leftrightarrow W\Delta_{ij} = 0 \mbox{, where } \Delta_{ij} = x_i - x_j<br />
\end{align}<br />
</math><br />
<br />
Thus, as long as <math>W\,</math> forms a basis spanned by its rows <math>W_k\,</math> such that <math>\forall i,j \,\,\exists \alpha \in \mathbb{R}^{d_h} | \Delta_{ij} = \sum_{k=1}^{d_h}\alpha_k W_k</math>, then the injectivity of <math>h\left(x\right)</math> will be preserved (as this would imply <math>\Delta_{ij} = 0\,</math> above). Furthermore, if we limit the domain of <math>\,h</math> to <math>h\left(\mathcal{D}\right) \subset \left(0,1\right)^{d_h}</math>, containing only the values obtainable by <math>h\,</math> applied to the training set <math>\mathcal{D}</math>, then <math>\,h</math> is surjective by definition. Therefore, <math>\,h</math> will be bijective between <math>h\,</math> and <math>h\left(\mathcal{D}\right)</math>, meaning that <math>h\,</math> will be a local diffeomorphism around each point in the training set.<br />
<br />
==== Generating an Atlas from a Learned Feature Mapping ====<br />
<br />
We now need to determine how to generate local charts around each <math>x \in \mathcal{D}</math>. Since <math>h</math> must be sensitive to changes between <math>x_i</math> and one of its neighbours <math>x_j</math>, but insensitive to other changes, we expect this to be encoded in the spectrum of the Jacobian <math>J\left(x\right) = \frac{\partial h}{\partial x}\left(x\right)</math>. Thus, we define a local chart around <math>x</math> using the singular value decomposition of <math>\,J^T(x) = U(x)S(x)V^T(x)</math>. The tangent plane <math>\mathcal{H}_x</math> at <math>\,x</math> is then given by the span of the set of principal singular vectors <math>\mathcal{B}_x</math>, as long as the associated singular value is above a given small <math>\varepsilon\,</math>:<br />
<br />
:<math>\mathcal{B}_x = \{U_{:,k}(x) | S_{k,k}(x) > \varepsilon\} \mbox{ and } \mathcal{H}_x = \{x + v | v \in \mbox{span}\left(\mathcal{B}_x\right)\} </math><br />
<br />
where <math>U_{:,k}(x)\,</math> is the <math>k^{th}</math> column of <math>U\left(x\right)</math>. <br />
<br />
Then, we can define an atlas <math>\mathcal{A}</math> captured by <math>h\,</math>, based on the local linear approximation around each example:<br />
<br />
:<math> \mathcal{A} = \{\left(\mathcal{M}_x, \phi_x\right) | x\in\mathcal{D}, \phi_x\left(\tilde{x}\right) = \mathcal{B}_x\left(x - \tilde{x}\right)\}</math><br />
<br />
=== Exploiting Learned Directions for Classification ===<br />
<br />
We would like to use the local charts defined above as additional information for the task of classification. In doing so, we will adopt the '''manifold hypothesis for classification'''.<br />
<br />
==== CAE-Based Tangent Distance ====<br />
<br />
We start by defining the '''tangent distance''' between two points as the difference between their two respective hyperplanes <math>\mathcal{H}_x, \mathcal{H}_y</math> defined above, where distance is defined as:<br />
<br />
:<math> d\left(\mathcal{H}_x,\mathcal{H}_y\right) = \mbox{inf}\{\|z - w\|^2\,\, | \left(z,w\right) \in \mathcal{H}_x \times \mathcal{H}_y\}</math><br />
<br />
Finding this distance is a convex problem which is solvable by solving a system of linear equations.<ref>Simard, P., LeCun, Y., & Denker, J. S. (1993). [http://papers.nips.cc/paper/656-efficient-pattern-recognition-using-a-new-transformation-distance.pdf Efficient pattern recognition using a new transformation distance.] In Advances in neural information processing systems (pp. 50-58).</ref> Minimizing the distance in this way allows <math>x, y \in \mathcal{D}</math> to move along their associated tangent spaces, and have the distance evaluated where <math>x</math> and <math>y</math> are closest. A nearest-neighbour classifier could then be used based on this distance.<br />
<br />
==== CAE-Based Tangent Propagation ====<br />
<br />
Nearest-neighbour techniques work in theory, but are often impractical for large-scale datasets. Classifying test points in this way grows linearly with the number of training points. Neural networks, however, can quickly classify test points once they are trained. We would like the output <math>o</math> of the classifier to be insensitive to variations in the directions of the local chart around <math>x</math>. To this end, we add the following penalty to the objective function of the (supervised) network:<br />
<br />
:<math> \Omega\left(x\right) = \sum_{u \in \mathcal{B}_x} \left|\left| \frac{\partial o}{\partial x}\left(x\right) u \right|\right|^2 </math><br />
<br />
=== The Manifold Tangent Classifier (MTC) ===<br />
<br />
Finally, we are able to put all of the results together into a full algorithm for training a network. The steps follow below:<br />
<br />
# Train (unsupervised) a stack of <math>K\,</math> CAE+H layers as in section 2.2.2. Each layer is trained on the representation learned by the previous layer.<br />
# For each <math>x_i \in \mathcal{D}</math>, compute the Jacobian of the last layer representation <math>J^{(K)}(x_i) = \frac{\partial h^{(K)}}{\partial x}\left(x_i\right)</math> and its SVD. Note that <math>J^{(K)}\,</math> is the product of the Jacobians of each encoder. Store the leading <math>d_M\,</math> singular vectors in <math>\mathcal{B}_{x_i}</math>.<br />
# After the <math>K\,</math> CAE+H layers, add a sigmoidal output layer with a node for each class. Train the entire network for supervised classification, adding in the propagation penalty in 2.4.2. Note that for each <math>x_i, \mathcal{B}_{x_i}</math> contains the set of tangent vectors to use.<br />
<br />
== Results ==<br />
<br />
=== Datasets Considered ===<br />
<br />
The MTC was tested on the following datasets:<br />
<br />
*'''MNIST''': Set of 28 by 28 images of handwritten digits, and the goal is to predict the digit contained in the image.<br />
*'''Reuters Corpus Volume I''': Contains 800,000 real-world news stories. Used the 2000 most frequent words calculated on the whole dataset to create a bag-of-words representation.<br />
*'''CIFAR-10''': Dataset of 70,000 32 by 32 RGB real-world images. <br />
*'''Forest Cover Type''': Large-scale database of cartographic variables for prediction of forest cover types.<br />
<br />
=== Method ===<br />
<br />
To investigate the improvements made by CAE-learned tangents, the following method is employed: Optimal hyper-parameters (e.g. <math>\gamma, \lambda\,,</math> etc.) were selected by cross-validation on a disjoint validation set disjoint from the training set. The quality of the features extracted by the CAE is evaluated by initializing a standard multi-layer perceptron network with the same parameters as the trained CAE and fine-tuning it by backpropagation on the supervised task.<br />
<br />
=== Visualization of Learned Tangents === <br />
<br />
Figure 1 visualizes the tangents learned by CAE. The example is on the left, and 8 tangents are shown to the right. On the MNIST dataset, the tangents are small geometric transformations. For CIFAR-10, the tangents appear to be parts of the image. For Reuters, the tangents correspond to addition/removal of similar words, with the positive terms in green and the negative terms in red. We see that the tangents do not seem to change the class of the example (e.g. the tangents of the above "0" in MNIST all resemble zeroes).<br />
<br />
[[File:Figure_1_MTC.png|frame|center|Fig. 1: Tangents Extracted by CAE]]<br />
<br />
=== MTC in Semi-Supervised Setting ===<br />
<br />
The MTC method was evaluated on the MNIST dataset in a semi-supervised setting: the unsupervised feature extractor is trained on the full training set, and the supervised classifier is only trained on a restricted label set. The results with a single layer perceptron initialized with CAE+H pretraining (abbreviated CAE), and the same classifier with tangent propagation added (i.e. MTC) are in table 1. The performance is compared to other methods the do not consider the semi-supervised learning hypothesis (Support Vector Machines (SVM), Neural Networks (NN), Convolutional Neural Networks (CNN)), and those methods perform poorly against MTC, especially when labeled data is low. <br />
<br />
{| class="wikitable"<br />
|+Table 1: Semi-Supervised classification error on MNIST test set<br />
|-<br />
|'''# Labeled'''<br />
|'''NN'''<br />
|'''SVM'''<br />
|'''CNN'''<br />
|'''CAE'''<br />
|'''MTC'''<br />
|-<br />
|100<br />
|25.81<br />
|23.44<br />
|22.98<br />
|13.47<br />
|'''12.03'''<br />
|-<br />
|600<br />
|11.44<br />
|8.85<br />
|7.68<br />
|6.3<br />
|'''5.13'''<br />
|-<br />
|1000<br />
|10.7<br />
|7.77<br />
|6.45<br />
|4.77<br />
|'''3.64'''<br />
|-<br />
|3000<br />
|6.04<br />
|4.21<br />
|3.35<br />
|3.22<br />
|'''2.57''' <br />
|}<br />
<br />
=== MTC in Full Classification Problems ===<br />
<br />
We consider using MTC to classify using the full MNIST dataset (i.e. the fully supervised problem), and compare with other methods. The CAE used for tangent discovery is a two-layer deep network with 2000 units per-layer pretrained with the CAE+H objective. The MTC uses the same stack of CAEs trained with tangent propagation, using <math>d_M = 15\,</math> tangents. The MTC produces state-of-the-art results, achieving a 0.81% error on the test set (as opposed to the previous state-of-the-art result of 0.95% error, achieved by Deep Boltzmann Machines). Table 2 summarizes this result. Note that MTC also beats out CNN, which utilizes prior knowledge about vision using convolutions and pooling.<br />
<br />
{| class="wikitable"<br />
|+Table 2: Class. error on MNIST Test Set with full Training Set<br />
|-<br />
|K-NN<br />
|NN<br />
|SVM<br />
|CAE<br />
|DBM<br />
|CNN<br />
|MTC<br />
|-<br />
|3.09%<br />
|1.60%<br />
|1.40%<br />
|1.04%<br />
|0.95%<br />
|0.95%<br />
|'''0.81'''%<br />
|}<br />
<br />
A 4-layer MTC was trained on the Forest CoverType dataset. The MTC produces the best performance on this classification task, beating out the previous best method which used a mixture of non-linear SVMs (denoted as distributed SVM).<br />
<br />
{| class="wikitable"<br />
|+Table 3: Class. error on Forest Data<br />
|-<br />
|SVM<br />
|Distributed SVM<br />
|MTC<br />
|-<br />
|4.11%<br />
|3.46%<br />
|'''3.13'''%<br />
|}<br />
<br />
== Conclusion ==<br />
<br />
This paper unifies three common generic prior hypotheses in a unified manner. It uses a semi-supervised manifold approach to examine local charts around training points in the data, and then uses the tangents generated by these local charts to compare different classes. The tangents that are generated seem to be a meaningful decompositions of the training examples. When combining the tangents with the classifier, state-of-the-art results are obtained on classification problems in a variety of domains.<br />
<br />
== Discussion ==<br />
<br />
* I thought about how it could be possible to use an element-wise rectified linear unit <math>R\left(x\right) = \mbox{max}\left(0,x\right)</math> in place of the sigmoidal function for encoding, as this type of functional form has seen success in other deep learning methods. However, I believe that this type of functional form would preclude <math>h</math> from being diffeomorphic, as the <math>x</math>-values that are negative could not possibly be reconstructed. Thus, the sigmoidal form should likely be retained, although it would be interesting to see how other invertible non-linearities would perform (e.g. hyperbolic tangent).<br />
<br />
* It would be interesting to investigate applying the method of tangent extraction to other unsupervised methods, and then create a classifier based on these tangents in the same way that it is done in this paper. Further work could be done to apply this approach to clustering algorithms, kernel PCA, E-M, etc. This is more of a suggestion than a concrete idea, however.<br />
<br />
* It is not exactly clear to me how a <math>h</math> could ever define a true diffeomorphism, since <math>h: \mathbb{R}^{d} \mapsto \mathbb{R}^{d_h}</math>, where <math>d \ne d_h</math>, in general. Normally, if <math>d > d_h</math>, we would not expect such a map <math>h</math> to possibly be injective, since the cardinality of the domain is higher than that of the codomain. However, they may be able to "manufacture" the injectivity of <math>h</math> using the fact that <math>\mathcal{D}</math> is a discrete set of points. I'm not sure that this approach defines a continuous manifold, but I'm also not sure if that really matters in this case.<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=the_Manifold_Tangent_Classifier&diff=26551the Manifold Tangent Classifier2015-11-18T23:16:33Z<p>Alcateri: Created page with "== Introduction == The goal in many machine learning problems is to extract information from data with minimal prior knowledge<ref name = "main"> Rifai, S., Dauphin, Y. N., Vinc..."</p>
<hr />
<div>== Introduction ==<br />
<br />
The goal in many machine learning problems is to extract information from data with minimal prior knowledge<ref name = "main"> Rifai, S., Dauphin, Y. N., Vincent, P., Bengio, Y., & Muller, X. (2011). [http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2011_1240.pdf The manifold tangent classifier.] In Advances in Neural Information Processing Systems (pp. 2294-2302). </ref> These algorithms are designed to work on numerous problems which they may not be specifically tailored towards, thus domain-specific knowledge is generally not incorporated into the models. However, some generic "prior" hypotheses are considered to aid in the general task of learning, and three very common ones are presented below:<br />
<br />
# The '''semi-supervised learning hypothesis''': This states that knowledge of the input distribution <math>p\left(x\right)</math> can aid in learning the output distribution <math>p\left(y|x\right)</math> .<ref>Lasserre, J., Bishop, C. M., & Minka, T. P. (2006, June). [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1640745 Principled hybrids of generative and discriminative models.] In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on (Vol. 1, pp. 87-94). IEEE.</ref> This hypothesis lends credence to not only the theory of strict semi-supervised learning, but also unsupervised pretraining as a method of feature extraction.<ref> Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). [http://www.mitpressjournals.org/doi/pdf/10.1162/neco.2006.18.7.1527 A fast learning algorithm for deep belief nets.] Neural computation, 18(7), 1527-1554.</ref><ref>Erhan, D., Bengio, Y., Courville, A., Manzagol, P. A., Vincent, P., & Bengio, S. (2010). [http://delivery.acm.org/10.1145/1760000/1756025/p625-erhan.pdf?ip=129.97.89.222&id=1756025&acc=PUBLIC&key=FD0067F557510FFB%2E9219CF56F73DCF78%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=561475515&CFTOKEN=96787671&__acm__=1447710319_1ea806f74c2b3b6959e97d9d0e03d533 Why does unsupervised pre-training help deep learning?.] The Journal of Machine Learning Research, 11, 625-660.</ref><br />
# The '''unsupervised manifold hypothesis''': This states that real-world data presented in high-dimensional spaces is likely to concentrate around a low-dimensional sub-manifold.<ref>Cayton, L. (2005). [http://www.vis.lbl.gov/~romano/mlgroup/papers/manifold-learning.pdf Algorithms for manifold learning.] Univ. of California at San Diego Tech. Rep, 1-17.</ref><br />
# The '''manifold hypothesis for classification''': This states that points of different classes are likely to concentrate along different sub-manifolds, separated by low-density regions of the input space.<ref name = "main"></ref><br />
<br />
The recently-proposed Contractive Auto-Encoder (CAE) algorithm has shown success in the task of unsupervised feature extraction,<ref name = "CAE">Rifai, S., Vincent, P., Muller, X., Glorot, X., & Bengio, Y. (2011). [http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf Contractive auto-encoders: Explicit invariance during feature extraction.] In Proceedings of the 28th International Conference on Machine Learning (ICML-11) (pp. 833-840).</ref> with its successful application in pre-training of Deep Neural Networks (DNN) an illustration of the merits of adopting '''Hypothesis 1'''. CAE also yields a mostly contractive mapping that is locally only sensitive to a few input directions, which implies that it models a lower-dimensional manifold (exploiting '''Hypothesis 2''') since the directions of sensitivity are in the tangent space of the manifold. <br />
<br />
This paper furthers the previous work by using the information about the tangent spaces by considering '''Hypothesis 3''': it extracts basis vectors for the local tangent space around each training point from the parameters of the CAE. Then, older supervised classification algorithms that exploit tangent directions as domain-specific prior knowledge can be used on the tangent spaces generated by CAE for fine-tuning the overall classification network. This approach seamlessly integrates all three of the above hypotheses and produces record-breaking results (for 2011) on image classification.<br />
<br />
== Contractive Auto-Encoders (CAE) and Tangent Classification ==<br />
<br />
The problem is to find a non-linear feature extractor for a dataset <math>\mathcal{D} = \{x_1, \ldots, x_n\}</math>, where <math>x_i \in \mathbb{R}^d</math> are i.i.d. samples from an unknown distribution <math> p\left(x\right)</math>.<br />
<br />
=== Traditional Auto-Encoders === <br />
<br />
A traditional auto-encoder learns an '''encoder''' function <math>h: \mathbb{R}^d \rightarrow \mathbb{R}^{d_h}</math> along with a '''decoder''' function <math>g: \mathbb{R}^{d_h} \rightarrow \mathbb{R}</math>, represented as <math>r = g\left(h\left(x\right)\right) </math>. <math>h\,</math> maps input <math>x\,</math> to the hidden input space, and <math>g\,</math> reconstructs <math>x\,</math>. When <math>L\left(x,g\left(h\left(x\right)\right)\right)</math> denotes the average reconstruction error, the objective function being optimized to learn the parameters <math>\theta\,</math> of the encoder/decoder is as follows:<br />
<br />
:<math> \mathcal{J}_{AE}\left(\theta\right) = \sum_{x\in\mathcal{D}}L\left(x,g\left(h\left(x\right)\right)\right) </math><br />
<br />
The form of the '''encoder''' is <math>h\left(x\right) = s\left(Wx + b_h\right)</math>, where <math>s\left(z\right) = \frac{1}{1 + e^{-z}}</math> is the element-wise logistic sigmoid. <math>W \in \mathbb{R}^{d_h \times d} </math> and <math>b_h \in \mathbb{R}^{d_h}</math> are the parameters (weight matrix and bias vector, respectively). The form of the '''decoder''' is <math>r = g\left(h\left(x\right)\right) = s_2\left(W^Th\left(x\right)+b_r\right)</math>, where <math>\,s_2 = s</math> or the identity. The weight matrix <math>W^T\,</math> is shared with the encoder, with the only new parameter being the bias vector <math>b_r \in \mathbb{R}^d</math>.<br />
<br />
The '''loss function''' can either be the squared error <math>L\left(x,r\right) = \|x - r\|^2</math> or the Bernoulli cross-entropy, given by: <br />
<br />
:<math> L\left(x, r\right) = -\sum_{i=1}^d \left[x_i \mbox{log}\left(r_i\right) + \left(1 - x_i\right)\mbox{log}\left(1 - r_i\right)\right]</math><br />
<br />
=== First- and Higher-Order Contractive Auto-Encoders ===<br />
<br />
==== Additional Penalty on Jacobian ==== <br />
<br />
The Contractive Auto-Encoder (CAE), proposed by Rifai et al.<ref name = "CAE"></ref>, encourages robustness of <math>h\left(x\right)</math> to small variations in <math>x</math> by penalizing the Frobenius norm of the encoder's Jacobian <math>J\left(x\right) = \frac{\partial h}{\partial x}\left(x\right)</math>. The new objective function to be minimized is:<br />
<br />
:<math> \mathcal{J}_{CAE}\left(\theta\right) = \sum_{x\in\mathcal{D}}L\left(x,g\left(h\left(x\right)\right)\right) + \lambda\|J\left(x\right)\|_F^2 </math><br />
<br />
where <math>\lambda</math> is a non-negative regularization parameter. We can compute the <math>j^{th}</math> row of the Jacobian of the sigmoidal encoder quite easily using the <math>j^{th}</math> row of <math>W</math>:<br />
<br />
:<math> J\left(x\right)_j = \frac{\partial h_j\left(x\right)}{\partial x} = h_j\left(x\right)\left(1 - h_j\left(x\right)\right)W_j</math><br />
<br />
==== Additional Penalty on Hessian ====<br />
<br />
It is also possible to penalize higher-order derivatives by approximating the Hessian (explicit computation of the Hessian is costly). It is sufficient to penalize the difference between <math>J\left(x\right)</math> and <math>J\left(x + \varepsilon\right)</math> where <math>\,\varepsilon </math> is small, as this represents the rate of change of the Jacobian. This yields the "CAE+H" variant, with objective function as follows:<br />
<br />
:<math> \mathcal{J}_{CAE+H}\left(\theta\right) = \mathcal{J}_{CAE}\left(\theta\right) + \gamma\sum_{x \in \mathcal{D}}\mathbb{E}_{\varepsilon\sim\mathcal{N}\left(0,\sigma^2I\right)} \left[\|J\left(x\right) - J\left(x + \varepsilon\right)\|^2\right] </math><br />
<br />
The expectation above, in practice, is taken over stochastic samples of the noise variable <math>\varepsilon\,</math> at each stochastic gradient descent step. <math>\gamma\,</math> is another regularization parameter. This formulation will be the one used within this paper.<br />
<br />
=== Characterizing the Tangent Bundle Captured by a CAE ===<br />
<br />
Although the regularization term encourages insensitivity of <math>h(x)</math> in all input space directions, the pressure to form an accurate reconstruction counters this somewhat, and the result is that <math>h(x)</math> is only sensitive to a few input directions necessary to distinguish close-by training points.<ref name = "CAE"></ref> Geometrically, the interpretation is that these directions span the local tangent space of the underlying manifold the characterizes the input data. <br />
<br />
==== Geometric Terms ====<br />
<br />
* '''Tangent Bundle''': The tangent bundle of a smooth manifold is the manifold along with the set of tangent planes taken at all points in it.<br />
* '''Chart''': A local Euclidean coordinate system equipped to a tangent plane. Each tangent plane has its own chart.<br />
* '''Atlas''': A collection of local charts.<br />
<br />
==== Conditions for Feature Mapping to Define an Atlas on a Manifold ====<br />
<br />
To obtain a proper atlas of charts, <math>h</math> must be a local diffeomorphism (locally smooth and invertible). Since the sigmoidal mapping is smooth, <math>\,h</math> is guaranteed to be smooth. To determine injectivity of <math>h\,</math>, consider the following, <math>\forall x_i, x_j \in \mathcal{D}</math>:<br />
<br />
:<math><br />
\begin{align}<br />
h(x_i) = h(x_j) &\Leftrightarrow s\left(Wx_i + b_h\right) = s\left(Wx_j + b_h\right) \\<br />
& \Leftrightarrow Wx_i + b_h = Wx_j + b_h \mbox{, since } s \mbox{ is invertible} \\<br />
& \Leftrightarrow W\Delta_{ij} = 0 \mbox{, where } \Delta_{ij} = x_i - x_j<br />
\end{align}<br />
</math><br />
<br />
Thus, as long as <math>W\,</math> forms a basis spanned by its rows <math>W_k\,</math> such that <math>\forall i,j \,\,\exists \alpha \in \mathbb{R}^{d_h} | \Delta_{ij} = \sum_{k=1}^{d_h}\alpha_k W_k</math>, then the injectivity of <math>h\left(x\right)</math> will be preserved (as this would imply <math>\Delta_{ij} = 0\,</math> above). Furthermore, if we limit the domain of <math>\,h</math> to <math>h\left(\mathcal{D}\right) \subset \left(0,1\right)^{d_h}</math>, containing only the values obtainable by <math>h\,</math> applied to the training set <math>\mathcal{D}</math>, then <math>\,h</math> is surjective by definition. Therefore, <math>\,h</math> will be bijective between <math>h\,</math> and <math>h\left(\mathcal{D}\right)</math>, meaning that <math>h\,</math> will be a local diffeomorphism around each point in the training set.<br />
<br />
==== Generating an Atlas from a Learned Feature Mapping ====<br />
<br />
We now need to determine how to generate local charts around each <math>x \in \mathcal{D}</math>. Since <math>h</math> must be sensitive to changes between <math>x_i</math> and one of its neighbours <math>x_j</math>, but insensitive to other changes, we expect this to be encoded in the spectrum of the Jacobian <math>J\left(x\right) = \frac{\partial h}{\partial x}\left(x\right)</math>. Thus, we define a local chart around <math>x</math> using the singular value decomposition of <math>\,J^T(x) = U(x)S(x)V^T(x)</math>. The tangent plane <math>\mathcal{H}_x</math> at <math>\,x</math> is then given by the span of the set of principal singular vectors <math>\mathcal{B}_x</math>, as long as the associated singular value is above a given small <math>\varepsilon\,</math>:<br />
<br />
:<math>\mathcal{B}_x = \{U_{:,k}(x) | S_{k,k}(x) > \varepsilon\} \mbox{ and } \mathcal{H}_x = \{x + v | v \in \mbox{span}\left(\mathcal{B}_x\right)\} </math><br />
<br />
where <math>U_{:,k}(x)\,</math> is the <math>k^{th}</math> column of <math>U\left(x\right)</math>. <br />
<br />
Then, we can define an atlas <math>\mathcal{A}</math> captured by <math>h\,</math>, based on the local linear approximation around each example:<br />
<br />
:<math> \mathcal{A} = \{\left(\mathcal{M}_x, \phi_x\right) | x\in\mathcal{D}, \phi_x\left(\tilde{x}\right) = \mathcal{B}_x\left(x - \tilde{x}\right)\}</math><br />
<br />
=== Exploiting Learned Directions for Classification ===<br />
<br />
We would like to use the local charts defined above as additional information for the task of classification. In doing so, we will adopt the '''manifold hypothesis for classification'''.<br />
<br />
==== CAE-Based Tangent Distance ====<br />
<br />
We start by defining the '''tangent distance''' between two points as the difference between their two respective hyperplanes <math>\mathcal{H}_x, \mathcal{H}_y</math> defined above, where distance is defined as:<br />
<br />
:<math> d\left(\mathcal{H}_x,\mathcal{H}_y\right) = \mbox{inf}\{\|z - w\|^2\,\, | \left(z,w\right) \in \mathcal{H}_x \times \mathcal{H}_y\}</math><br />
<br />
Finding this distance is a convex problem which is solvable by solving a system of linear equations.<ref>Simard, P., LeCun, Y., & Denker, J. S. (1993). [http://papers.nips.cc/paper/656-efficient-pattern-recognition-using-a-new-transformation-distance.pdf Efficient pattern recognition using a new transformation distance.] In Advances in neural information processing systems (pp. 50-58).</ref> Minimizing the distance in this way allows <math>x, y \in \mathcal{D}</math> to move along their associated tangent spaces, and have the distance evaluated where <math>x</math> and <math>y</math> are closest. A nearest-neighbour classifier could then be used based on this distance.<br />
<br />
==== CAE-Based Tangent Propagation ====<br />
<br />
Nearest-neighbour techniques work in theory, but are often impractical for large-scale datasets. Classifying test points in this way grows linearly with the number of training points. Neural networks, however, can quickly classify test points once they are trained. We would like the output <math>o</math> of the classifier to be insensitive to variations in the directions of the local chart around <math>x</math>. To this end, we add the following penalty to the objective function of the (supervised) network:<br />
<br />
:<math> \Omega\left(x\right) = \sum_{u \in \mathcal{B}_x} \left|\left| \frac{\partial o}{\partial x}\left(x\right) u \right|\right|^2 </math><br />
<br />
=== The Manifold Tangent Classifier (MTC) ===<br />
<br />
Finally, we are able to put all of the results together into a full algorithm for training a network. The steps follow below:<br />
<br />
# Train (unsupervised) a stack of <math>K\,</math> CAE+H layers as in section 2.2.2. Each layer is trained on the representation learned by the previous layer.<br />
# For each <math>x_i \in \mathcal{D}</math>, compute the Jacobian of the last layer representation <math>J^{(K)}(x_i) = \frac{\partial h^{(K)}}{\partial x}\left(x_i\right)</math> and its SVD. Note that <math>J^{(K)}\,</math> is the product of the Jacobians of each encoder. Store the leading <math>d_M\,</math> singular vectors in <math>\mathcal{B}_{x_i}</math>.<br />
# After the <math>K\,</math> CAE+H layers, add a sigmoidal output layer with a node for each class. Train the entire network for supervised classification, adding in the propagation penalty in 2.4.2. Note that for each <math>x_i, \mathcal{B}_{x_i}</math> contains the set of tangent vectors to use.<br />
<br />
== Results ==<br />
<br />
=== Datasets Considered ===<br />
<br />
The MTC was tested on the following datasets:<br />
<br />
*'''MNIST''': Set of 28 by 28 images of handwritten digits, and the goal is to predict the digit contained in the image.<br />
*'''Reuters Corpus Volume I''': Contains 800,000 real-world news stories. Used the 2000 most frequent words calculated on the whole dataset to create a bag-of-words representation.<br />
*'''CIFAR-10''': Dataset of 70,000 32 by 32 RGB real-world images. <br />
*'''Forest Cover Type''': Large-scale database of cartographic variables for prediction of forest cover types.<br />
<br />
=== Method ===<br />
<br />
To investigate the improvements made by CAE-learned tangents, the following method is employed: Optimal hyper-parameters (e.g. <math>\gamma, \lambda\,,</math> etc.) were selected by cross-validation on a disjoint validation set disjoint from the training set. The quality of the features extracted by the CAE is evaluated by initializing a standard multi-layer perceptron network with the same parameters as the trained CAE and fine-tuning it by backpropagation on the supervised task.<br />
<br />
=== Visualization of Learned Tangents === <br />
<br />
Figure 1 visualizes the tangents learned by CAE. The example is on the left, and 8 tangents are shown to the right. On the MNIST dataset, the tangents are small geometric transformations. For CIFAR-10, the tangents appear to be parts of the image. For Reuters, the tangents correspond to addition/removal of similar words, with the positive terms in green and the negative terms in red. We see that the tangents do not seem to change the class of the example (e.g. the tangents of the above "0" in MNIST all resemble zeroes).<br />
<br />
[[File:Figure_1_MTC.png|frame|center|Fig. 1: Tangents Extracted by CAE]]<br />
<br />
=== MTC in Semi-Supervised Setting ===<br />
<br />
The MTC method was evaluated on the MNIST dataset in a semi-supervised setting: the unsupervised feature extractor is trained on the full training set, and the supervised classifier is only trained on a restricted label set. The results with a single layer perceptron initialized with CAE+H pretraining (abbreviated CAE), and the same classifier with tangent propagation added (i.e. MTC) are in table 1. The performance is compared to other methods the do not consider the semi-supervised learning hypothesis (Support Vector Machines (SVM), Neural Networks (NN), Convolutional Neural Networks (CNN)), and those methods perform poorly against MTC, especially when labeled data is low. <br />
<br />
{| class="wikitable"<br />
|+Table 1: Semi-Supervised classification error on MNIST test set<br />
|-<br />
|'''# Labeled'''<br />
|'''NN'''<br />
|'''SVM'''<br />
|'''CNN'''<br />
|'''CAE'''<br />
|'''MTC'''<br />
|-<br />
|100<br />
|25.81<br />
|23.44<br />
|22.98<br />
|13.47<br />
|'''12.03'''<br />
|-<br />
|600<br />
|11.44<br />
|8.85<br />
|7.68<br />
|6.3<br />
|'''5.13'''<br />
|-<br />
|1000<br />
|10.7<br />
|7.77<br />
|6.45<br />
|4.77<br />
|'''3.64'''<br />
|-<br />
|3000<br />
|6.04<br />
|4.21<br />
|3.35<br />
|3.22<br />
|'''2.57''' <br />
|}<br />
<br />
=== MTC in Full Classification Problems ===<br />
<br />
We consider using MTC to classify using the full MNIST dataset (i.e. the fully supervised problem), and compare with other methods. The CAE used for tangent discovery is a two-layer deep network with 2000 units per-layer pretrained with the CAE+H objective. The MTC uses the same stack of CAEs trained with tangent propagation, using <math>d_M = 15\,</math> tangents. The MTC produces state-of-the-art results, achieving a 0.81% error on the test set (as opposed to the previous state-of-the-art result of 0.95% error, achieved by Deep Boltzmann Machines). Table 2 summarizes this result. Note that MTC also beats out CNN, which utilizes prior knowledge about vision using convolutions and pooling.<br />
<br />
{| class="wikitable"<br />
|+Table 2: Class. error on MNIST Test Set with full Training Set<br />
|-<br />
|K-NN<br />
|NN<br />
|SVM<br />
|CAE<br />
|DBM<br />
|CNN<br />
|MTC<br />
|-<br />
|3.09%<br />
|1.60%<br />
|1.40%<br />
|1.04%<br />
|0.95%<br />
|0.95%<br />
|'''0.81'''%<br />
|}<br />
<br />
A 4-layer MTC was trained on the Forest CoverType dataset. The MTC produces the best performance on this classification task, beating out the previous best method which used a mixture of non-linear SVMs (denoted as distributed SVM).<br />
<br />
{| class="wikitable"<br />
|+Table 3: Class. error on Forest Data<br />
|-<br />
|SVM<br />
|Distributed SVM<br />
|MTC<br />
|-<br />
|4.11%<br />
|3.46%<br />
|'''3.13'''%<br />
|}<br />
<br />
== Conclusion ==<br />
<br />
This paper unifies three common generic prior hypotheses in a unified manner. It uses a semi-supervised manifold approach to examine local charts around training points in the data, and then uses the tangents generated by these local charts to compare different classes. The tangents that are generated seem to be a meaningful decompositions of the training examples. When combining the tangents with the classifier, state-of-the-art results are obtained on classification problems in a variety of domains.<br />
<br />
== Discussion ==<br />
<br />
* I thought about how it could be possible to use an element-wise rectified linear unit <math>R\left(x\right) = \mbox{max}\left(0,x\right)</math> in place of the sigmoidal function for encoding, as this type of functional form has seen success in other deep learning methods. However, I believe that this type of functional form would preclude <math>h</math> from being diffeomorphic, as the <math>x</math>-values that are negative could not possibly be reconstructed. Thus, the sigmoidal form should likely be retained, although it would be interesting to see how other invertible non-linearities would perform (e.g. hyperbolic tangent).<br />
<br />
* It would be interesting to investigate applying the method of tangent extraction to other unsupervised methods, and then create a classifier based on these tangents in the same way that it is done in this paper. Further work could be done to apply this approach to clustering algorithms, kernel PCA, E-M, etc. This is more of a suggestion than a concrete idea, however.<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Figure_1_MTC.png&diff=26537File:Figure 1 MTC.png2015-11-18T22:16:24Z<p>Alcateri: Visualization of the learned tangents of CAE</p>
<hr />
<div>Visualization of the learned tangents of CAE</div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f15Stat946PaperSignUp&diff=26281f15Stat946PaperSignUp2015-11-16T17:39:46Z<p>Alcateri: /* Set B */</p>
<hr />
<div> <br />
=[https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/listofpapers1.pdf List of Papers]=<br />
<br />
= Record your contributions [https://docs.google.com/spreadsheets/d/1A_0ej3S6ns3bBMwWLS4pwA6zDLz_0Ivwujj-d1Gr9eo/edit?usp=sharing here:]=<br />
<br />
Use the following notations:<br />
<br />
S: You have written a summary on the paper<br />
<br />
T: You had technical contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
E: You had editorial contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
[http://goo.gl/forms/RASFRZXoxJ Your feedback on presentations]<br />
<br />
<br />
=Set A=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Oct 16 || pascal poupart || || Guest Lecturer||||<br />
|-<br />
|Oct 16 ||pascal poupart || ||Guest Lecturer ||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 ||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Oct 23 || Deepak Rishi || || Parsing natural scenes and natural language with recursive neural networks || [http://www-nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf Paper] || [[Parsing natural scenes and natural language with recursive neural networks | Summary]]<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 ||Rui Qiao || ||Going deeper with convolutions || [http://arxiv.org/pdf/1409.4842v1.pdf Paper]|| [[GoingDeeperWithConvolutions|Summary]]<br />
|-<br />
|Oct 30 ||Amirreza Lashkari|| 21 ||Overfeat: integrated recognition, localization and detection using convolutional networks. || [http://arxiv.org/pdf/1312.6229v4.pdf Paper]|| [[Overfeat: integrated recognition, localization and detection using convolutional networks|Summary]]<br />
|-<br />
|Mkeup Class (TBA) || Peter Blouw|| ||Memory Networks.|| [http://arxiv.org/abs/1410.3916]|| [[Memory Networks|Summary]]<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Anthony Caterini ||56 || Human-level control through deep reinforcement learning ||[http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf Paper]|| [[Human-level control through deep reinforcement learning|Summary]]<br />
|-<br />
|Nov 6 || Sean Aubin || ||Learning Hierarchical Features for Scene Labeling ||[http://yann.lecun.com/exdb/publis/pdf/farabet-pami-13.pdf Paper]||[[Learning Hierarchical Features for Scene Labeling|Summary]]<br />
|-<br />
|Nov 13|| Mike Hynes || 12 ||Speech recognition with deep recurrent neural networks || [http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf Paper] || [[Graves et al., Speech recognition with deep recurrent neural networks|Summary]]<br />
|-<br />
|Nov 13 || Tim Tse || || Question Answering with Subgraph Embeddings || [http://arxiv.org/pdf/1406.3676v3.pdf Paper] || [[Question Answering with Subgraph Embeddings | Summary ]]<br />
|-<br />
|Nov 13 || Maysum Panju || ||Neural machine translation by jointly learning to align and translate ||[http://arxiv.org/pdf/1409.0473v6.pdf Paper] || [[Neural Machine Translation: Jointly Learning to Align and Translate|Summary]]<br />
|-<br />
|Nov 13 || Abdullah Rashwan || || Deep neural networks for acoustic modeling in speech recognition. ||[http://research.microsoft.com/pubs/171498/HintonDengYuEtAl-SPM2012.pdf paper]|| [[Deep neural networks for acoustic modeling in speech recognition| Summary]]<br />
|-<br />
|Nov 20 || Valerie Platsko || ||Natural language processing (almost) from scratch. ||[http://arxiv.org/pdf/1103.0398.pdf Paper]|| [[Natural language processing (almost) from scratch. | Summary]]<br />
|-<br />
|Nov 20 || Brent Komer || ||Show, Attend and Tell: Neural Image Caption Generation with Visual Attention || [http://arxiv.org/pdf/1502.03044v2.pdf Paper]||[[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention|Summary]]<br />
|-<br />
|Nov 20 || Luyao Ruan || || Dropout: A Simple Way to Prevent Neural Networks from Overfitting || [https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf Paper]|| [[dropout | Summary]]<br />
|-<br />
|Nov 20 || Ali Mahdipour || || The human splicing code reveals new insights into the genetic determinants of disease ||[https://www.sciencemag.org/content/347/6218/1254806.full.pdf Paper] || [[Genetics | Summary]]<br />
|-<br />
|Nov 27 ||Mahmood Gohari || ||Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships ||[http://pubs.acs.org/doi/abs/10.1021/ci500747n.pdf Paper]||<br />
|-<br />
|Nov 27 || Derek Latremouille || ||The Wake-Sleep Algorithm for Unsupervised Neural Networks || [http://www.gatsby.ucl.ac.uk/~dayan/papers/hdfn95.pdf Paper] ||<br />
|-<br />
|Nov 27 ||Xinran Liu || ||ImageNet Classification with Deep Convolutional Neural Networks ||[http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Paper]||[[ImageNet Classification with Deep Convolutional Neural Networks|Summary]]<br />
|-<br />
|Nov 27 ||Ali Sarhadi|| ||Strategies for Training Large Scale Neural Network Language Models||||<br />
|-<br />
|Dec 4 || Chris Choi || || On the difficulty of training recurrent neural networks || [http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf Paper] || [[On the difficulty of training recurrent neural networks | Summary]]<br />
|-<br />
|Dec 4 || Fatemeh Karimi || ||MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION||[http://arxiv.org/pdf/1412.7755v2.pdf Paper]||<br />
|-<br />
|Dec 4 || Jan Gosmann || || On the Number of Linear Regions of Deep Neural Networks || [http://arxiv.org/abs/1402.1869 Paper] || [[On the Number of Linear Regions of Deep Neural Networks | Summary]]<br />
|-<br />
|Dec 4 || Dylan Drover || || Towards AI-complete question answering: a set of prerequisite toy tasks || [http://arxiv.org/pdf/1502.05698.pdf Paper] ||<br />
|-<br />
|}<br />
|}<br />
<br />
=Set B=<br />
<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Anthony Caterini ||15 ||The Manifold Tangent Classifier ||[http://papers.nips.cc/paper/4409-the-manifold-tangent-classifier.pdf Paper]|| [[The Manifold Tangent Classifier|Summary]]<br />
|-<br />
|Jan Gosmann || || Neural Turing machines || [http://arxiv.org/abs/1410.5401 Paper] || [[Neural Turing Machines|Summary]]<br />
|-<br />
|Brent Komer || || Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers || [http://arxiv.org/pdf/1202.2160v2.pdf Paper] || [[Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines|Summary]]<br />
|-<br />
|Sean Aubin || || Deep Sparse Rectifier Neural Networks || [http://jmlr.csail.mit.edu/proceedings/papers/v15/glorot11a/glorot11a.pdf Paper] || [[Deep Sparse Rectifier Neural Networks|Summary]]<br />
|-<br />
|Peter Blouw|| || Generating text with recurrent neural networks || [http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf Paper] ||<br />
|-<br />
|Tim Tse|| || From Machine Learning to Machine Reasoning || [http://research.microsoft.com/pubs/206768/mlj-2013.pdf Paper] || [[From Machine Learning to Machine Reasoning | Summary ]]<br />
|-<br />
|Rui Qiao|| || Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation || [http://arxiv.org/pdf/1406.1078v3.pdf Paper] || [[Learning Phrase Representations|Summary]]<br />
|-<br />
|Ftemeh Karimi|| 23 || Very Deep Convoloutional Networks for Large-Scale Image Recognition || [http://arxiv.org/pdf/1409.1556.pdf Paper] || [[Very Deep Convoloutional Networks for Large-Scale Image Recognition|Summary]]<br />
|-<br />
|Amirreza Lashkari|| 43 || Distributed Representations of Words and Phrases and their Compositionality || [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Paper] || [[Distributed Representations of Words and Phrases and their Compositionality|Summary]]<br />
|-<br />
|Xinran Liu|| 19 || Joint training of a convolutional network and a graphical model for human pose estimation || [http://papers.nips.cc/paper/5573-joint-training-of-a-convolutional-network-and-a-graphical-model-for-human-pose-estimation.pdf Paper] || [[Joint training of a convolutional network and a graphical model for human pose estimation|Summary]]<br />
|-<br />
|Chris Choi|| || Learning Long-Range Vision for Autonomous Off-Road Driving || [http://yann.lecun.com/exdb/publis/pdf/hadsell-jfr-09.pdf Paper] || [[Learning Long-Range Vision for Autonomous Off-Road Driving|Summary]]<br />
|-<br />
|Luyao Ruan|| || Deep Learning of the tissue-regulated splicing code || [http://bioinformatics.oxfordjournals.org/content/30/12/i121.full.pdf+html Paper] || [[Deep Learning of the tissue-regulated splicing code| Summary]]<br />
|-<br />
|Abdullah Rashwan|| || Deep Convolutional Neural Networks For LVCSR || [http://www.cs.toronto.edu/~asamir/papers/icassp13_cnn.pdf paper] || [[Deep Convolutional Neural Networks For LVCSR| Summary]]<br />
|-<br />
|Mahmood Gohari||37 || On using very large target vocabulary for neural machine translation || [http://arxiv.org/pdf/1412.2007v2.pdf paper] || [[On using very large target vocabulary for neural machine translation| Summary]]<br />
|-<br />
|Valerie Platsko|| || Learning Convolutional Feature Hierarchies for Visual Recognition || [http://papers.nips.cc/paper/4133-learning-convolutional-feature-hierarchies-for-visual-recognition Paper] || [[Learning Convolutional Feature Hierarchies for Visual Recognition | Summary]]<br />
|-<br />
|Derek Latremouille|| || Learning fast approximations of sparse coding || [http://yann.lecun.com/exdb/publis/pdf/gregor-icml-10.pdf Paper] || [[Learning fast approximations of sparse coding | Summary]]<br />
|-<br />
|Ri Wang|| || Continuous space language models || [https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenSemester2_2009_10/sdarticle.pdf Paper] || [[Continuous space language models | Summary]]</div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_neural_networks_for_acoustic_modeling_in_speech_recognition&diff=26146deep neural networks for acoustic modeling in speech recognition2015-11-12T16:54:51Z<p>Alcateri: /* Generative Pretraining */</p>
<hr />
<div>= Introduction =<br />
Classical speech recognition systems use hidden Markov models (HMMs) to model the temporal variations and Gaussian mixture models (GMMs) to determine the likelihood of each state of each HMM given an acoustic observation. The speech signal in the classical systems are represented by a series of Mel-frequency cepstral coefficients (MFCCs) or perceptual linear predictive coefficients (PLPs) extracted from overlapping short windows of the raw speech signal. Although GMMs are quite flexible and are fairly easy to train using the Expectation Maximization (EM) algorithm, they are inefficient when modelling data that lie close to a nonlinear manifold which is the case for the speech data. Deep Neural Networks (DNNs) don't suffer from the same shortcoming, hence they can learn much better models than GMMs. Over the past few years, training DNNs has become possible thanks to the advancements in machine learning and computer hardware, which makes it possible to replace GMMs with DNNs in the speech recognition systems. DNNs are proved to outperform GMMs in both small and large vocabulary speech recognition tasks.<br />
<br />
= Training Deep Neural Networks =<br />
DNNs are feed-forward neural networks that have multiple of hidden layers. The last layer is a softmax layer which gives the class probabilities. The weights for the DNNs are learnt using backpropagation algorithm, it was found empirically that computing the gradient using small random mini-batches is more efficient. To avoid overfitting, early stopping is used by stopping the training when the accuracy over validation set starts to decrease. The pretraining is essential when the amount of training data is small. Restricted Boltzmann Machines (RBMs) are used for pretraining except for the first layer which uses Gaussian-Bernoulli RBM (GRBM) since the input is real-value.<br />
<br />
== Generative Pretraining ==<br />
<br />
We would like to create a method which uses information in the training set to build multiple layers of nonlinear feature detectors. For this, the "generative pretraining" method is proposed. The concept is as follows: a feature detector that successfully models the structure in the input data, as opposed to one that distinguishes between classes, is the desired result. Thus, we learn one layer of features at a time, and then send these learned features into the next stage as training data. This stacked model structure can create features which are much more useful than raw data, and can help against overfitting.<br />
<br />
The generative model chosen can be either a directed or undirected graph, with undirected being the choice in this paper. An undirected model is chosen because inference is easy as long as each hidden layer only contains connections to other layers, and no connections to itself. A Restricted Boltzmann Machine (RBM) is chosen in this case. <br />
<br />
=== Learning Procedure for RBMs === <br />
<br />
The energy function for the RBM is given by:<br />
<br />
<math> E\left(\mathbf{v}, \mathbf{h}; \mathbf{W}\right) = - \sum_{i \in visible}a_iv_i - \sum_{j \in hidden}b_j h_j - \sum_{i, j} v_i h_j w_{ij} </math>, where<br />
<br />
* <math>\mathbf{v}</math> is the vector of visible units, with components <math>v_i</math> and associated biases <math>a_i</math><br />
* <math>\mathbf{h}</math> is the vector of hidden units, with components <math>h_j</math> and associated biases <math>b_j</math><br />
* <math>\mathbf{W} </math> is the weight matrix between the visible units and hidden units, with components <math>w_{ij}</math><br />
<br />
Then, the joint distribution function is given by:<br />
<br />
<math> p\left(\mathbf{v}, \mathbf{h}; \mathbf{W} \right) = \frac{1}{Z} \mbox{ exp}\left[-E\left(\mathbf{v},\mathbf{h};\mathbf{W}\right)\right] </math><br />
<br />
where <math>Z</math> is a normalization factor.<br />
<br />
Using the law of total probability, we can obtain <math>p\left(\mathbf{v}\right) = \frac{1}{Z} \sum_{\mathbf{h}}\mbox{exp}\left[-E\left(\mathbf{v}, \mathbf{h}\right)\right] </math>. Now, we can obtain the derivative of the log probability of a training set with respect to a weight as: <math> \frac{1}{N} \sum_{n=1}^N \frac{\partial \mbox{ log } p\left(\mathbf{v}^n\right)}{\partial w_{ij}} = <v_ih_j>_{data} - <v_i h_j>_{model}</math>, where <math> < > </math> denotes expectation.<br />
<br />
We can easily obtain an unbiased sample of <math><v_i h_j>_{data}</math> since the conditional probabilities are as follows:<br />
<br />
<math> p\left(h_j = 1 | \mathbf{v}\right) = \mbox{logistic}\left(b_j + \sum_{i} v_i w_{ij}\right) </math><br />
<br />
<math> p\left(v_i = 1 | \mathbf{h}\right) = \mbox{logistic}\left(a_i + \sum_{j} h_j w_{ij}\right) </math><br />
<br />
Obtaining an unbiased a sample of <math><v_i h_j>_{model}</math> is much more difficult though. Alternating Gibbs sampling is the ideal choice, but it can be slow. A faster procedure called "Contrastive Divergence" (CD) is used here instead, and it is similar to Gibbs sampling but terminates after only one full step of alternating Gibbs sampling. Even though CD only crudely approximates the gradient, it seems to perform well in practice. Also, since we are only pretraining the model, additional Gibbs sampling steps are not necessary, and the randomness produced by using CD may further help prevent overfitting.<br />
<br />
= Interfacing a DNN with an HMM =<br />
<br />
HMM model requires the likelihoods of the observations <math>p(AcousticInput|HMMstate)</math> for running the forward-backward algorithm or for computing a Viterbi alignment. DNNs output the posteriors <math>p(HMMstate|AcousticInput)</math> which can be converted to scaled version of the likelihood by dividing them by <math>p(HMMstate)</math>, where <math>p(HMMstate)</math> is the HMM states frequencies in the training data. The conversion from the posteriors to the likelihoods is important when the training labels are highly unbalanced.<br />
<br />
= Phonetic Classification and Recognition on TIMIT =<br />
TIMIT is an acoustic-phonetic countinuous speech corpus that has been widely used as a benchmark data set for the speech recognition systems. DNN-HMM systems outperformed the classical GMM-HMM systems. The first successful attempt for building a DNN-HMM speech recognition system was published in 2009 by Mohamed et. al.<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref>, they reported a significant improvement in the accuracy over the state-of-the-art DNN-HMM systems at that time. It was found that the structure of the DNN (i.e. number of hidden layers, and number of hidden units per layer) has little effect on the accuracy, which made it possible to focus more on learning the metaparameters of the DNN. Details of the learning rates, stopping criteria, momentum, L2 weight penalties and minibatch size, pretraining, and fine-tuning can be found in<br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref>.<br />
<br />
== Using Filter-Bank Features ==<br />
MFCC features are commonly used in the GMM-HMM systems because they provide uncorrelated features, which is important to avoid using full covariance GMMs. Some of the acoustic information is lost due to using MFCCs. DNNs on the other hand can work with correlated features which opened the room for using [http://en.wikipedia.org/wiki/Filter_bank filter-bank] features. It was found that using filter-bank features with DNNs improved the accuracy by 1.7% <ref name=tuning_fb_DBN></ref>.<br />
<br />
== Fine-Tuning DNNs To Optimize Mutual Information ==<br />
In the experiments mentioned earlier this section, the system were tuned to optimize the per frame cross entropy or the log posterior probablity <math>p(l_t|v_t)</math>; where <math>l_t</math> is label at time <math>t</math>, and <math>v_t</math> is the features at the same time step. The transition probabilities and the language models were tuned independently using the HMM framework. The DNN can be tuned to optimize the conditional probability <math>p(l_{1:T}|v_{1:T})</math>, this is done for the softmax layer only and by fixing the parameters of the hidden layers <math>h</math>.<br />
<br />
<math>p(l_{1:T}|v_{1:T}) = p(l_{1:T}|h_{1:T}) = \frac{\exp(\sum_{t=1}^T\gamma_{ij} \phi_{ij}(l_{t-1},l_t) + \sum_{t=1}^T\sum_{d=1}^D \lambda_{l_t,d} h_{td})}{Z(h_{1:T})}</math><br />
<br />
Where <math>\phi_{i,j}(l_{t-1},l_t)</math> is the transition feature and it takes a value of one if <math>l_{t-1} = i</math> and <math>l_{t} = j</math> and zero otherwise, <math>\gamma_{ij}</math> is the parameter associated with the transition feature, <math>\lambda</math> are the weights of the softmax layer. <math>\gamma,\lambda</math> are tuning using gradient descent, and the experiments show the fine tuning DNNs to optimize the mutual information improved the accuracy by 5% relative on TIMIT data set<br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref>.<br />
<br />
== Convolutional DNNs for Phone Classification and Recognition ==<br />
<br />
Convolutional DNNs were introduced in 2009 and they were applied to various audio tasks including TIMIT dataset <ref><br />
H. Lee, P. Pham, Y. Largman, and A. Ng, “Unsupervised feature learning for<br />
audio classification using convolutional deep belief networks,” in Advances in<br />
Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J.<br />
Lafferty, C. K. I. Williams, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2009,<br />
pp. 1096–1104.<br />
</ref>. In this work, convolutional DNNs were applied on the temporal dimension in order to extract same features at different times. Since the temporal variations are already handled by the HMM, Abdelhameed et. al. proposed to apply convolutional DNNs on the frequency domain instead <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>. Weight-sharing and max-pooling was used for the nearby frequencies because acoustic features for different frequencies are very different. They achieved the least phone error rate of 20% on TIMIT dataset reported at that time.<br />
<br />
== DNNs and GMMs ==<br />
As you can see in the following table, monophone DNNs can outperform the best triphone GMM-HMM by 1.7%, the reason behind such success can be due to the following:<br />
# DNNs are instance of product of experts in which each parameter is constrained by large amount of the data, while GMMs are sum of experts for which each parameter applies to a small amount of the data.<br />
# GMM assumes that each datapoint is generated from a single component which makes it inefficient when it comes to modeling multiple simultaneous events. DNNs is flexible enough to model multiple simultaneous events.<br />
# GMMs are restricted to uncorrelated features, while GMMs can work with correlated features. This allows GMMs to use correlated features such as filter-banks, it also allows to analyse a larger window of the signal at each timestep.<br />
{| class="wikitable"<br />
|+ Compariosons among the reported speaker-independent (SI) phonetic error rate (PER) results on TIMIT core test set with 192 sentences.<br />
|-<br />
! Method<br />
! PER<br />
|-<br />
| CD-HMM <ref name=cdhmm><br />
Y. Hifny and S. Renals, “Speech recognition using augmented conditional<br />
random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp.<br />
354–365, 2009.<br />
</ref><br />
| 27.3%<br />
|-<br />
| Augmented Conditional Random Fienlds <ref name=cdhmm></ref><br />
| 26.6%<br />
|-<br />
| Randomly Initialized Recurrent Neural Nets <ref name=rirnn><br />
A. Robinson, “An application to recurrent nets to phone probability estimation,”<br />
IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.<br />
</ref><br />
| 26.1%<br />
|-<br />
| Bayesian Triphone GMM-HMM <ref name=btgmmhmm><br />
J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone<br />
models,” in Proc. ICASSP, 1998, pp. 409–412.<br />
</ref><br />
| 25.6%<br />
|-<br />
| Monophone HTMs <ref name=mhtms><br />
L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden<br />
trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp.<br />
445–448.<br />
</ref><br />
| 24.8%<br />
|-<br />
| Heterogeneous Classifiers <ref name=hclass><br />
A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple<br />
classifiers for speech recognition,” in Proc. ICSLP, 1998.<br />
</ref><br />
| 24.4%<br />
|-<br />
| Monophone Randomly Initialized DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 23.4%<br />
|-<br />
| Monophone DBN-DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 22.4%<br />
|-<br />
| Monophone DBN-DNNs with MMI Training <ref name=finetuningDNN></ref><br />
| 22.1%<br />
|-<br />
| Triphone GMM-HMMs DT W/BMMI <ref name=tgmmhmmbmmi><br />
T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky,<br />
“Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE<br />
Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011.<br />
</ref><br />
| 21.7%<br />
|-<br />
| Monophone DBN-DNNs on FBank (Eight Layers) <ref name=tuning_fb_DBN></ref><br />
| 20.7%<br />
|-<br />
| Monophone MCRBM-DBN-DNNs On FBank (Five Layers) <ref name=mmcrbmdbndnn><br />
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition<br />
with the mean-covariance restricted Boltzmann machine,” in Advances in Neural<br />
Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-<br />
Taylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp.<br />
469–477.<br />
</ref><br />
| 20.5%<br />
|-<br />
| Monphone Convolutional DNNs On Fbank (Three Layers) <ref name=cdnnfb><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref><br />
| 20.0%<br />
|}<br />
<br />
<br />
<br />
= DNN for Large-Vocabulary Speech Recognition =<br />
<br />
The success of DNN-HMM system on TIMIT data set opened the door for trying the same technique on larger data sets. It was found that using context dependent HMM states is essential for large data sets. DNN-HMM systems were tested on five different large tasks and they outperform GMM-HMM systems on every task.<br />
<br />
== Bing-Voice-Search Speech Recognition Task == <br />
Bing-Voice-Search is a 24 hrs of speech data with different sources of acoustic variations such as noise, music, side-speech, accents, sloppy pronunciations, interruptions, and mobile phone differences. The DNN-HMM system was trained based on the DNN that worked well for TIMIT. it contained 5 hidden layers of size 2048 each. A window of 11 frames was used to classify the middle frame into the corresponding HMM state, and tri-phone states were used instead of monophones. The DNN-HMM system achieved a sentence accuracy of 69% compared to 63.8% for the GMM-HMM system. The accuracy for the DNN-HMM system was further improved by increasing the data size to 48 hrs achieving a sentence accuracy of 71.7% <ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref>.<br />
== Switchboard Speech Recognition Task ==<br />
This data set contains over 300 hrs of speech data, and the test set is 6.3 hrs of speech. The same DNN-HMM system developed for Bing data set was applied to this data set. The DNN used contains 7 hidden layers of size 2048 each. A trigram language model was used, it was trained using 2000 hrs of speech corpus.<br />
This data set is publicly available which allows for rigorous comparisons among different techniques to be performed. As shown in the table below, DNN reduced the word error rate from 27.4% to 18.5% <ref name=switchboard><br />
F. Seide, G. Li, and D. Yu, “Conversational speech transcription using<br />
context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp.<br />
437–440.<br />
</ref><br />
. DNN system performed as well as a GMM system that combines several speaker-adaptive multipass systems and uses nearly 7 times as much acoustic data for training as the DNN system.<br />
{| class="wikitable"<br />
|+ Comparing five different DBN-DNN acoustic models with two strong GMM-HMM baseline systems that are discriminatively trained. "40 Mix" means a mixture of 40 gaussians per hmm state. Word error rates in % are shown for two separate test sets, HUB500-SWB and RT03S-FSH.<br />
|-<br />
! Technique<br />
! HUB5'00-SWB<br />
! RT03S-FSH<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 23.6<br />
| 27.4<br />
|-<br />
| NN 1 HIDDEN-LAYER x 4,634 UNITS<br />
| 26.0<br />
| 29.4<br />
|-<br />
| + 2 x 5 NEIGHBORING FRAMES<br />
| 22.4<br />
| 25.7<br />
|-<br />
| DBN-DNN 7 HIDDEN LAYERS x 2,048 UNITS<br />
| 17.1<br />
| 19.6<br />
|-<br />
| + UPDATED STATE ALIGNMENT<br />
| 16.4<br />
| 18.6<br />
|-<br />
| + SPARSIFICATION<br />
| 16.1<br />
| 18.5<br />
|-<br />
| GMM 72 MIX DT 2000H SA<br />
| 17.1<br />
| 18.6<br />
|}<br />
<br />
== Google Voice Input Speech Recognition Task ==<br />
The data set contains search engine queries, short messages, events, and user actions from mobile devices. A well-trained GMM-HMM system was used to align 5870 hrs of speech for the DNN-HMM system. The DNN used is 4 hidden layer with 2560 hidden units per layer. A window of 11 frames was used to classify the middle frame, and 40 log filter-bank features were used to represent each frame. The network was fine-tuned to maximize the mutual information. The DNN was sparsified by setting weights that are less that certain threshold to zero. To further improve the accuracy, the DNN-HMM and the GMM-HMM models were combined using segmental conditional random field framework <ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>. Using DNN-HMM reduced the word error rate by 23% relative achieving a word error rate of 12.3%. combining both models (GMM and DNN) further reduced the word error rate to 11.8%.<br />
== Youtube Speech Recognition Task ==<br />
The goal for this task is to transcribe YouTube data. This type of data doesn't have strong language model to constraint the interpretation of the speech information, so a strong acoustic model is essential. A well-trained GMM-HMM system was used to align a 1400 hrs of data for the DNN model. A decision tree clustering algorithm was used to cluster the HMM states into 17552 context-dependent tri-phone states. Feature space maximum likelihood linear regression (fMLLR)-transformed features were used. Due to the large number of HMM states, only 4 hidden layer were used to save computation resources for the large softmax layer. After training, sequence-level fine tuning was performed. Also, both DNN-HMM and GMM-HMM models were combined to improve the accuracy. DNN-HMM system reduced the word error rate by 4.7% absolute, fine tuning reduced the word error rate by more 0.5%, and combining both models gave another reduction of 0.9%.<br />
== English Broadcast News Speech Recognition Task ==<br />
<br />
A GUMM-HMM baseline system was used to align 50 hrs of speech from 1996 and 1997 ENglish Broadcast News Speech Corpora. SAT and DT features were used to train the DNN. The network consists of six hidden layers of 1024 units each. The number of HMM states is 2220 triphone states. A window of 9 frames was used to classify the middle frame. Fine tuning was done after training to optimize the mutual information. The DNN-HMM system achieved a word error rate of 17.5% compared to 18.8% for the best GMM-HMM system <ref name=broadcastDNN><br />
T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Improvements in using<br />
deep belief networks for large vocabulary continuous speech recognition,”<br />
Speech and Language Algorithm Group, IBM, Yorktown Heights, NY, Tech.<br />
Rep. UTML TR 2010-003, Feb. 2011.<br />
</ref>.<br />
<br />
== Summary for the Main Results for DNN Acoustic Models on Large Data Sets ==<br />
<br />
The following table shows the performance of DNN-HMMs compared to GMM-HMMs, and it is obvious how DNNs are superior to GMM in terms of the accuracies.<br />
{| class="wikitable"<br />
|+ A comparison of the percentage WERs using DNN-HMMs and GMM-HMMs on five different large vocabulary tasks.<br />
|-<br />
! Task<br />
! Hours of training data<br />
! DNN-HMM<br />
! GMM-HMM<br />
! GMM-HMM using larger training data<br />
|-<br />
| SWITCHBOARD (TEST SET 1)<br />
| 309<br />
| 18.5<br />
| 27.4<br />
| 18.6 (2000h)<br />
|-<br />
| SWITCHBOARD (TEST SET 2)<br />
| 309<br />
| 16.1<br />
| 23.6<br />
| 17.1 (2000h)<br />
|-<br />
| ENGLISH BROADCAST NEWS<br />
| 50<br />
| 17.5<br />
| 18.8<br />
| <br />
|-<br />
| BING VOICE SEARCH (sentence error rates)<br />
| 24<br />
| 30.4<br />
| 36.2<br />
| <br />
|-<br />
| GOOGLE VOICE INPUT<br />
| 5870<br />
| 12.3<br />
| <br />
| 16.0 (>> 5870h)<br />
|-<br />
| YouTube<br />
| 1400<br />
| 47.6<br />
| 52.3<br />
| <br />
|}<br />
<br />
= Alternative Pretraining Methods for DNNs =<br />
Pretraining DNNs was reported to improve the results on TIMIT and large data sets tasks. The pretraining was done generatively using a stack of RBMs. Another way for pretraining is discriminative pretraining. In discriminative pretraining, we start from a shallow network and the weights are trained discriminatively. After that, another hidden layer is added between the last hidden layer and the softmax layer, then the weights for the new added layer is again discrimintively learned and so on. Finally backpropagation fine-tuning for the whole network is applied. This way of training was reported to achieve the same results achieved by generative pretraining <ref name=discrDNN><br />
F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent<br />
deep neural networks for conversational speech transcription,” in Proc.<br />
IEEE ASRU, 2011, pp. 24–29.<br />
</ref>.<br />
<br />
= Conclusions and Discussions =<br />
GMMs have been used widely for acoustic modelling, they are easy to train and quite flexible. Since 2009, DNNs were proposed to replace GMMs, they have been proven to be superior to GMMs in many speech recognition tasks. DNNs pretraining is essential when the amount of data is small, it reduces overfitting and the convergence time. Fine-tuning the network to optimize the mutual information can improve the results. The authors think that there are yet many things that can be done in pretraining, fine-tuning, and using different types of hidden units to further increase the performance of DNNs.<br />
<br />
This paper summarizes the recent research that has been done by three research groups in the area of speech recognition. They have shown that DNNs are superior to GMMs in both small and large dataset speech recognition tasks. The authors claimed that the reason for such superiority is that the speech data lies on a manifold which I am not sure if there is any scientific/empirical proof for such claim.<br />
This paper is an excellent source if someone is interested in deep learning in speech recognition. The author assume that the readers are familiar with speech recognition HMM framework, otherwise it will be difficult to follow. Also, the DNN area is moving fast, and the information in the paper is not up to date.<br />
<br />
= References =<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_neural_networks_for_acoustic_modeling_in_speech_recognition&diff=26135deep neural networks for acoustic modeling in speech recognition2015-11-12T14:25:37Z<p>Alcateri: /* Generative Pretraining */ - Introduction of Generative Pretraining</p>
<hr />
<div>= Introduction =<br />
Classical speech recognition systems use hidden Markov models (HMMs) to model the temporal variations and Gaussian mixture models (GMMs) to determine the likelihood of each state of each HMM given an acoustic observation. The speech signal in the classical systems are represented by a series of Mel-frequency cepstral coefficients (MFCCs) or perceptual linear predictive coefficients (PLPs) extracted from overlapping short windows of the raw speech signal. Although GMMs are quite flexible and are fairly easy to train using the Expectation Maximization (EM) algorithm, they are inefficient when modelling data that lie close to a nonlinear manifold which is the case for the speech data. Deep Neural Networks (DNNs) don't suffer from the same shortcoming, hence they can learn much better models than GMMs. Over the past few years, training DNNs has become possible thanks to the advancements in machine learning and computer hardware, which makes it possible to replace GMMs with DNNs in the speech recognition systems. DNNs are proved to outperform GMMs in both small and large vocabulary speech recognition tasks.<br />
<br />
= Training Deep Neural Networks =<br />
DNNs are feed-forward neural networks that have multiple of hidden layers. The last layer is a softmax layer which gives the class probabilities. The weights for the DNNs are learnt using backpropagation algorithm, it was found empirically that computing the gradient using small random mini-batches is more efficient. To avoid overfitting, early stopping is used by stopping the training when the accuracy over validation set starts to decrease. The pretraining is essential when the amount of training data is small. Restricted Boltzmann Machines (RBMs) are used for pretraining except for the first layer which uses Gaussian-Bernoulli RBM (GRBM) since the input is real-value.<br />
<br />
== Generative Pretraining ==<br />
<br />
We would like to create a method which uses information in the training set to build multiple layers of nonlinear feature detectors. For this, the "generative pretraining" method is proposed. The concept is as follows: a feature detector that successfully models the structure in the input data, as opposed to one that distinguishes between classes, is the desired result. Thus, we learn one layer of features at a time, and then send these learned features into the next stage as training data. This stacked model structure can create features which are much more useful than raw data, and can help against overfitting.<br />
<br />
= Interfacing a DNN with an HMM =<br />
<br />
HMM model requires the likelihoods of the observations <math>p(AcousticInput|HMMstate)</math> for running the forward-backward algorithm or for computing a Viterbi alignment. DNNs output the posteriors <math>p(HMMstate|AcousticInput)</math> which can be converted to scaled version of the likelihood by dividing them by <math>p(HMMstate)</math>, where <math>p(HMMstate)</math> is the HMM states frequencies in the training data. The conversion from the posteriors to the likelihoods is important when the training labels are highly unbalanced.<br />
<br />
= Phonetic Classification and Recognition on TIMIT =<br />
TIMIT is an acoustic-phonetic countinuous speech corpus that has been widely used as a benchmark data set for the speech recognition systems. DNN-HMM systems outperformed the classical GMM-HMM systems. The first successful attempt for building a DNN-HMM speech recognition system was published in 2009 by Mohamed et. al.<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref>, they reported a significant improvement in the accuracy over the state-of-the-art DNN-HMM systems at that time. It was found that the structure of the DNN (i.e. number of hidden layers, and number of hidden units per layer) has little effect on the accuracy, which made it possible to focus more on learning the metaparameters of the DNN. Details of the learning rates, stopping criteria, momentum, L2 weight penalties and minibatch size, pretraining, and fine-tuning can be found in<br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref>.<br />
<br />
== Using Filter-Bank Features ==<br />
MFCC features are commonly used in the GMM-HMM systems because they provide uncorrelated features, which is important to avoid using full covariance GMMs. Some of the acoustic information is lost due to using MFCCs. DNNs on the other hand can work with correlated features which opened the room for using [http://en.wikipedia.org/wiki/Filter_bank filter-bank] features. It was found that using filter-bank features with DNNs improved the accuracy by 1.7% <ref name=tuning_fb_DBN></ref>.<br />
<br />
== Fine-Tuning DNNs To Optimize Mutual Information ==<br />
In the experiments mentioned earlier this section, the system were tuned to optimize the per frame cross entropy or the log posterior probablity <math>p(l_t|v_t)</math>; where <math>l_t</math> is label at time <math>t</math>, and <math>v_t</math> is the features at the same time step. The transition probabilities and the language models were tuned independently using the HMM framework. The DNN can be tuned to optimize the conditional probability <math>p(l_{1:T}|v_{1:T})</math>, this is done for the softmax layer only and by fixing the parameters of the hidden layers <math>h</math>.<br />
<br />
<math>p(l_{1:T}|v_{1:T}) = p(l_{1:T}|h_{1:T}) = \frac{\exp(\sum_{t=1}^T\gamma_{ij} \phi_{ij}(l_{t-1},l_t) + \sum_{t=1}^T\sum_{d=1}^D \lambda_{l_t,d} h_{td})}{Z(h_{1:T})}</math><br />
<br />
Where <math>\phi_{i,j}(l_{t-1},l_t)</math> is the transition feature and it takes a value of one if <math>l_{t-1} = i</math> and <math>l_{t} = j</math> and zero otherwise, <math>\gamma_{ij}</math> is the parameter associated with the transition feature, <math>\lambda</math> are the weights of the softmax layer. <math>\gamma,\lambda</math> are tuning using gradient descent, and the experiments show the fine tuning DNNs to optimize the mutual information improved the accuracy by 5% relative on TIMIT data set<br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref>.<br />
<br />
== Convolutional DNNs for Phone Classification and Recognition ==<br />
<br />
Convolutional DNNs were introduced in 2009 and they were applied to various audio tasks including TIMIT dataset <ref><br />
H. Lee, P. Pham, Y. Largman, and A. Ng, “Unsupervised feature learning for<br />
audio classification using convolutional deep belief networks,” in Advances in<br />
Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J.<br />
Lafferty, C. K. I. Williams, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2009,<br />
pp. 1096–1104.<br />
</ref>. In this work, convolutional DNNs were applied on the temporal dimension in order to extract same features at different times. Since the temporal variations are already handled by the HMM, Abdelhameed et. al. proposed to apply convolutional DNNs on the frequency domain instead <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>. Weight-sharing and max-pooling was used for the nearby frequencies because acoustic features for different frequencies are very different. They achieved the least phone error rate of 20% on TIMIT dataset reported at that time.<br />
<br />
== DNNs and GMMs ==<br />
As you can see in the following table, monophone DNNs can outperform the best triphone GMM-HMM by 1.7%, the reason behind such success can be due to the following:<br />
# DNNs are instance of product of experts in which each parameter is constrained by large amount of the data, while GMMs are sum of experts for which each parameter applies to a small amount of the data.<br />
# GMM assumes that each datapoint is generated from a single component which makes it inefficient when it comes to modeling multiple simultaneous events. DNNs is flexible enough to model multiple simultaneous events.<br />
# GMMs are restricted to uncorrelated features, while GMMs can work with correlated features. This allows GMMs to use correlated features such as filter-banks, it also allows to analyse a larger window of the signal at each timestep.<br />
{| class="wikitable"<br />
|+ Compariosons among the reported speaker-independent (SI) phonetic error rate (PER) results on TIMIT core test set with 192 sentences.<br />
|-<br />
! Method<br />
! PER<br />
|-<br />
| CD-HMM <ref name=cdhmm><br />
Y. Hifny and S. Renals, “Speech recognition using augmented conditional<br />
random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp.<br />
354–365, 2009.<br />
</ref><br />
| 27.3%<br />
|-<br />
| Augmented Conditional Random Fienlds <ref name=cdhmm></ref><br />
| 26.6%<br />
|-<br />
| Randomly Initialized Recurrent Neural Nets <ref name=rirnn><br />
A. Robinson, “An application to recurrent nets to phone probability estimation,”<br />
IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.<br />
</ref><br />
| 26.1%<br />
|-<br />
| Bayesian Triphone GMM-HMM <ref name=btgmmhmm><br />
J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone<br />
models,” in Proc. ICASSP, 1998, pp. 409–412.<br />
</ref><br />
| 25.6%<br />
|-<br />
| Monophone HTMs <ref name=mhtms><br />
L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden<br />
trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp.<br />
445–448.<br />
</ref><br />
| 24.8%<br />
|-<br />
| Heterogeneous Classifiers <ref name=hclass><br />
A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple<br />
classifiers for speech recognition,” in Proc. ICSLP, 1998.<br />
</ref><br />
| 24.4%<br />
|-<br />
| Monophone Randomly Initialized DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 23.4%<br />
|-<br />
| Monophone DBN-DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 22.4%<br />
|-<br />
| Monophone DBN-DNNs with MMI Training <ref name=finetuningDNN></ref><br />
| 22.1%<br />
|-<br />
| Triphone GMM-HMMs DT W/BMMI <ref name=tgmmhmmbmmi><br />
T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky,<br />
“Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE<br />
Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011.<br />
</ref><br />
| 21.7%<br />
|-<br />
| Monophone DBN-DNNs on FBank (Eight Layers) <ref name=tuning_fb_DBN></ref><br />
| 20.7%<br />
|-<br />
| Monophone MCRBM-DBN-DNNs On FBank (Five Layers) <ref name=mmcrbmdbndnn><br />
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition<br />
with the mean-covariance restricted Boltzmann machine,” in Advances in Neural<br />
Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-<br />
Taylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp.<br />
469–477.<br />
</ref><br />
| 20.5%<br />
|-<br />
| Monphone Convolutional DNNs On Fbank (Three Layers) <ref name=cdnnfb><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref><br />
| 20.0%<br />
|}<br />
<br />
<br />
<br />
= DNN for Large-Vocabulary Speech Recognition =<br />
<br />
The success of DNN-HMM system on TIMIT data set opened the door for trying the same technique on larger data sets. It was found that using context dependent HMM states is essential for large data sets. DNN-HMM systems were tested on five different large tasks and they outperform GMM-HMM systems on every task.<br />
<br />
== Bing-Voice-Search Speech Recognition Task == <br />
Bing-Voice-Search is a 24 hrs of speech data with different sources of acoustic variations such as noise, music, side-speech, accents, sloppy pronunciations, interruptions, and mobile phone differences. The DNN-HMM system was trained based on the DNN that worked well for TIMIT. it contained 5 hidden layers of size 2048 each. A window of 11 frames was used to classify the middle frame into the corresponding HMM state, and tri-phone states were used instead of monophones. The DNN-HMM system achieved a sentence accuracy of 69% compared to 63.8% for the GMM-HMM system. The accuracy for the DNN-HMM system was further improved by increasing the data size to 48 hrs achieving a sentence accuracy of 71.7% <ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref>.<br />
== Switchboard Speech Recognition Task ==<br />
This data set contains over 300 hrs of speech data, and the test set is 6.3 hrs of speech. The same DNN-HMM system developed for Bing data set was applied to this data set. The DNN used contains 7 hidden layers of size 2048 each. A trigram language model was used, it was trained using 2000 hrs of speech corpus.<br />
This data set is publicly available which allows for rigorous comparisons among different techniques to be performed. As shown in the table below, DNN reduced the word error rate from 27.4% to 18.5% <ref name=switchboard><br />
F. Seide, G. Li, and D. Yu, “Conversational speech transcription using<br />
context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp.<br />
437–440.<br />
</ref><br />
. DNN system performed as well as a GMM system that combines several speaker-adaptive multipass systems and uses nearly 7 times as much acoustic data for training as the DNN system.<br />
{| class="wikitable"<br />
|+ Comparing five different DBN-DNN acoustic models with two strong GMM-HMM baseline systems that are discriminatively trained. "40 Mix" means a mixture of 40 gaussians per hmm state. Word error rates in % are shown for two separate test sets, HUB500-SWB and RT03S-FSH.<br />
|-<br />
! Technique<br />
! HUB5'00-SWB<br />
! RT03S-FSH<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 23.6<br />
| 27.4<br />
|-<br />
| NN 1 HIDDEN-LAYER x 4,634 UNITS<br />
| 26.0<br />
| 29.4<br />
|-<br />
| + 2 x 5 NEIGHBORING FRAMES<br />
| 22.4<br />
| 25.7<br />
|-<br />
| DBN-DNN 7 HIDDEN LAYERS x 2,048 UNITS<br />
| 17.1<br />
| 19.6<br />
|-<br />
| + UPDATED STATE ALIGNMENT<br />
| 16.4<br />
| 18.6<br />
|-<br />
| + SPARSIFICATION<br />
| 16.1<br />
| 18.5<br />
|-<br />
| GMM 72 MIX DT 2000H SA<br />
| 17.1<br />
| 18.6<br />
|}<br />
<br />
== Google Voice Input Speech Recognition Task ==<br />
The data set contains search engine queries, short messages, events, and user actions from mobile devices. A well-trained GMM-HMM system was used to align 5870 hrs of speech for the DNN-HMM system. The DNN used is 4 hidden layer with 2560 hidden units per layer. A window of 11 frames was used to classify the middle frame, and 40 log filter-bank features were used to represent each frame. The network was fine-tuned to maximize the mutual information. The DNN was sparsified by setting weights that are less that certain threshold to zero. To further improve the accuracy, the DNN-HMM and the GMM-HMM models were combined using segmental conditional random field framework <ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>. Using DNN-HMM reduced the word error rate by 23% relative achieving a word error rate of 12.3%. combining both models (GMM and DNN) further reduced the word error rate to 11.8%.<br />
== Youtube Speech Recognition Task ==<br />
The goal for this task is to transcribe YouTube data. This type of data doesn't have strong language model to constraint the interpretation of the speech information, so a strong acoustic model is essential. A well-trained GMM-HMM system was used to align a 1400 hrs of data for the DNN model. A decision tree clustering algorithm was used to cluster the HMM states into 17552 context-dependent tri-phone states. Feature space maximum likelihood linear regression (fMLLR)-transformed features were used. Due to the large number of HMM states, only 4 hidden layer were used to save computation resources for the large softmax layer. After training, sequence-level fine tuning was performed. Also, both DNN-HMM and GMM-HMM models were combined to improve the accuracy. DNN-HMM system reduced the word error rate by 4.7% absolute, fine tuning reduced the word error rate by more 0.5%, and combining both models gave another reduction of 0.9%.<br />
== English Broadcast News Speech Recognition Task ==<br />
<br />
A GUMM-HMM baseline system was used to align 50 hrs of speech from 1996 and 1997 ENglish Broadcast News Speech Corpora. SAT and DT features were used to train the DNN. The network consists of six hidden layers of 1024 units each. The number of HMM states is 2220 triphone states. A window of 9 frames was used to classify the middle frame. Fine tuning was done after training to optimize the mutual information. The DNN-HMM system achieved a word error rate of 17.5% compared to 18.8% for the best GMM-HMM system <ref name=broadcastDNN><br />
T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Improvements in using<br />
deep belief networks for large vocabulary continuous speech recognition,”<br />
Speech and Language Algorithm Group, IBM, Yorktown Heights, NY, Tech.<br />
Rep. UTML TR 2010-003, Feb. 2011.<br />
</ref>.<br />
<br />
== Summary for the Main Results for DNN Acoustic Models on Large Data Sets ==<br />
<br />
The following table shows the performance of DNN-HMMs compared to GMM-HMMs, and it is obvious how DNNs are superior to GMM in terms of the accuracies.<br />
{| class="wikitable"<br />
|+ A comparison of the percentage WERs using DNN-HMMs and GMM-HMMs on five different large vocabulary tasks.<br />
|-<br />
! Task<br />
! Hours of training data<br />
! DNN-HMM<br />
! GMM-HMM<br />
! GMM-HMM using larger training data<br />
|-<br />
| SWITCHBOARD (TEST SET 1)<br />
| 309<br />
| 18.5<br />
| 27.4<br />
| 18.6 (2000h)<br />
|-<br />
| SWITCHBOARD (TEST SET 2)<br />
| 309<br />
| 16.1<br />
| 23.6<br />
| 17.1 (2000h)<br />
|-<br />
| ENGLISH BROADCAST NEWS<br />
| 50<br />
| 17.5<br />
| 18.8<br />
| <br />
|-<br />
| BING VOICE SEARCH (sentence error rates)<br />
| 24<br />
| 30.4<br />
| 36.2<br />
| <br />
|-<br />
| GOOGLE VOICE INPUT<br />
| 5870<br />
| 12.3<br />
| <br />
| 16.0 (>> 5870h)<br />
|-<br />
| YouTube<br />
| 1400<br />
| 47.6<br />
| 52.3<br />
| <br />
|}<br />
<br />
= Alternative Pretraining Methods for DNNs =<br />
Pretraining DNNs was reported to improve the results on TIMIT and large data sets tasks. The pretraining was done generatively using a stack of RBMs. Another way for pretraining is discriminative pretraining. In discriminative pretraining, we start from a shallow network and the weights are trained discriminatively. After that, another hidden layer is added between the last hidden layer and the softmax layer, then the weights for the new added layer is again discrimintively learned and so on. Finally backpropagation fine-tuning for the whole network is applied. This way of training was reported to achieve the same results achieved by generative pretraining <ref name=discrDNN><br />
F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent<br />
deep neural networks for conversational speech transcription,” in Proc.<br />
IEEE ASRU, 2011, pp. 24–29.<br />
</ref>.<br />
<br />
= Conclusions and Discussions =<br />
GMMs have been used widely for acoustic modelling, they are easy to train and quite flexible. Since 2009, DNNs were proposed to replace GMMs, they have been proven to be superior to GMMs in many speech recognition tasks. DNNs pretraining is essential when the amount of data is small, it reduces overfitting and the convergence time. Fine-tuning the network to optimize the mutual information can improve the results. The authors think that there are yet many things that can be done in pretraining, fine-tuning, and using different types of hidden units to further increase the performance of DNNs.<br />
<br />
This paper summarizes the recent research that has been done by three research groups in the area of speech recognition. They have shown that DNNs are superior to GMMs in both small and large dataset speech recognition tasks. The authors claimed that the reason for such superiority is that the speech data lies on a manifold which I am not sure if there is any scientific/empirical proof for such claim.<br />
This paper is an excellent source if someone is interested in deep learning in speech recognition. The author assume that the readers are familiar with speech recognition HMM framework, otherwise it will be difficult to follow. Also, the DNN area is moving fast, and the information in the paper is not up to date.<br />
<br />
= References =<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_neural_networks_for_acoustic_modeling_in_speech_recognition&diff=26128deep neural networks for acoustic modeling in speech recognition2015-11-12T03:24:24Z<p>Alcateri: </p>
<hr />
<div>= Introduction =<br />
Classical speech recognition systems use hidden Markov models (HMMs) to model the temporal variations and Gaussian mixture models (GMMs) to determine the likelihood of each state of each HMM given an acoustic observation. The speech signal in the classical systems are represented by a series of Mel-frequency cepstral coefficients (MFCCs) or perceptual linear predictive coefficients (PLPs) extracted from overlapping short windows of the raw speech signal. Although GMMs are quite flexible and are fairly easy to train using the Expectation Maximization (EM) algorithm, they are inefficient when modelling data that lie close to a nonlinear manifold which is the case for the speech data. Deep Neural Networks (DNNs) don't suffer from the same shortcoming, hence they can learn much better models than GMMs. Over the past few years, training DNNs has become possible thanks to the advancements in machine learning and computer hardware, which makes it possible to replace GMMs with DNNs in the speech recognition systems. DNNs are proved to outperform GMMs in both small and large vocabulary speech recognition tasks.<br />
<br />
= Training Deep Neural Networks =<br />
DNNs are feed-forward neural networks that have multiple of hidden layers. The last layer is a softmax layer which gives the class probabilities. The weights for the DNNs are learnt using backpropagation algorithm, it was found empirically that computing the gradient using small random mini-batches is more efficient. To avoid overfitting, early stopping is used by stopping the training when the accuracy over validation set starts to decrease. The pretraining is essential when the amount of training data is small. Restricted Boltzmann Machines (RBMs) are used for pretraining except for the first layer which uses Gaussian-Bernoulli RBM (GRBM) since the input is real-value.<br />
<br />
== Generative Pretraining == <br />
<br />
= Interfacing a DNN with an HMM =<br />
<br />
HMM model requires the likelihoods of the observations <math>p(AcousticInput|HMMstate)</math> for running the forward-backward algorithm or for computing a Viterbi alignment. DNNs output the posteriors <math>p(HMMstate|AcousticInput)</math> which can be converted to scaled version of the likelihood by dividing them by <math>p(HMMstate)</math>, where <math>p(HMMstate)</math> is the HMM states frequencies in the training data. The conversion from the posteriors to the likelihoods is important when the training labels are highly unbalanced.<br />
<br />
= Phonetic Classification and Recognition on TIMIT =<br />
TIMIT is an acoustic-phonetic countinuous speech corpus that has been widely used as a benchmark data set for the speech recognition systems. DNN-HMM systems outperformed the classical GMM-HMM systems. The first successful attempt for building a DNN-HMM speech recognition system was published in 2009 by Mohamed et. al.<ref name=firstDBN><br />
A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.<br />
</ref>, they reported a significant improvement in the accuracy over the state-of-the-art DNN-HMM systems at that time. It was found that the structure of the DNN (i.e. number of hidden layers, and number of hidden units per layer) has little effect on the accuracy, which made it possible to focus more on learning the metaparameters of the DNN. Details of the learning rates, stopping criteria, momentum, L2 weight penalties and minibatch size, pretraining, and fine-tuning can be found in<br />
<ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief<br />
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22,<br />
Jan. 2012.</ref>.<br />
<br />
== Using Filter-Bank Features ==<br />
MFCC features are commonly used in the GMM-HMM systems because they provide uncorrelated features, which is important to avoid using full covariance GMMs. Some of the acoustic information is lost due to using MFCCs. DNNs on the other hand can work with correlated features which opened the room for using [http://en.wikipedia.org/wiki/Filter_bank filter-bank] features. It was found that using filter-bank features with DNNs improved the accuracy by 1.7% <ref name=tuning_fb_DBN></ref>.<br />
<br />
== Fine-Tuning DNNs To Optimize Mutual Information ==<br />
In the experiments mentioned earlier this section, the system were tuned to optimize the per frame cross entropy or the log posterior probablity <math>p(l_t|v_t)</math>; where <math>l_t</math> is label at time <math>t</math>, and <math>v_t</math> is the features at the same time step. The transition probabilities and the language models were tuned independently using the HMM framework. The DNN can be tuned to optimize the conditional probability <math>p(l_{1:T}|v_{1:T})</math>, this is done for the softmax layer only and by fixing the parameters of the hidden layers <math>h</math>.<br />
<br />
<math>p(l_{1:T}|v_{1:T}) = p(l_{1:T}|h_{1:T}) = \frac{\exp(\sum_{t=1}^T\gamma_{ij} \phi_{ij}(l_{t-1},l_t) + \sum_{t=1}^T\sum_{d=1}^D \lambda_{l_t,d} h_{td})}{Z(h_{1:T})}</math><br />
<br />
Where <math>\phi_{i,j}(l_{t-1},l_t)</math> is the transition feature and it takes a value of one if <math>l_{t-1} = i</math> and <math>l_{t} = j</math> and zero otherwise, <math>\gamma_{ij}</math> is the parameter associated with the transition feature, <math>\lambda</math> are the weights of the softmax layer. <math>\gamma,\lambda</math> are tuning using gradient descent, and the experiments show the fine tuning DNNs to optimize the mutual information improved the accuracy by 5% relative on TIMIT data set<br />
<ref name=finetuningDNN><br />
A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training<br />
of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp.<br />
2846–2849.<br />
</ref>.<br />
<br />
== Convolutional DNNs for Phone Classification and Recognition ==<br />
<br />
Convolutional DNNs were introduced in 2009 and they were applied to various audio tasks including TIMIT dataset <ref><br />
H. Lee, P. Pham, Y. Largman, and A. Ng, “Unsupervised feature learning for<br />
audio classification using convolutional deep belief networks,” in Advances in<br />
Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J.<br />
Lafferty, C. K. I. Williams, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2009,<br />
pp. 1096–1104.<br />
</ref>. In this work, convolutional DNNs were applied on the temporal dimension in order to extract same features at different times. Since the temporal variations are already handled by the HMM, Abdelhameed et. al. proposed to apply convolutional DNNs on the frequency domain instead <ref name=convDNN><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref>. Weight-sharing and max-pooling was used for the nearby frequencies because acoustic features for different frequencies are very different. They achieved the least phone error rate of 20% on TIMIT dataset reported at that time.<br />
<br />
== DNNs and GMMs ==<br />
As you can see in the following table, monophone DNNs can outperform the best triphone GMM-HMM by 1.7%, the reason behind such success can be due to the following:<br />
# DNNs are instance of product of experts in which each parameter is constrained by large amount of the data, while GMMs are sum of experts for which each parameter applies to a small amount of the data.<br />
# GMM assumes that each datapoint is generated from a single component which makes it inefficient when it comes to modeling multiple simultaneous events. DNNs is flexible enough to model multiple simultaneous events.<br />
# GMMs are restricted to uncorrelated features, while GMMs can work with correlated features. This allows GMMs to use correlated features such as filter-banks, it also allows to analyse a larger window of the signal at each timestep.<br />
{| class="wikitable"<br />
|+ Compariosons among the reported speaker-independent (SI) phonetic error rate (PER) results on TIMIT core test set with 192 sentences.<br />
|-<br />
! Method<br />
! PER<br />
|-<br />
| CD-HMM <ref name=cdhmm><br />
Y. Hifny and S. Renals, “Speech recognition using augmented conditional<br />
random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp.<br />
354–365, 2009.<br />
</ref><br />
| 27.3%<br />
|-<br />
| Augmented Conditional Random Fienlds <ref name=cdhmm></ref><br />
| 26.6%<br />
|-<br />
| Randomly Initialized Recurrent Neural Nets <ref name=rirnn><br />
A. Robinson, “An application to recurrent nets to phone probability estimation,”<br />
IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.<br />
</ref><br />
| 26.1%<br />
|-<br />
| Bayesian Triphone GMM-HMM <ref name=btgmmhmm><br />
J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone<br />
models,” in Proc. ICASSP, 1998, pp. 409–412.<br />
</ref><br />
| 25.6%<br />
|-<br />
| Monophone HTMs <ref name=mhtms><br />
L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden<br />
trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp.<br />
445–448.<br />
</ref><br />
| 24.8%<br />
|-<br />
| Heterogeneous Classifiers <ref name=hclass><br />
A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple<br />
classifiers for speech recognition,” in Proc. ICSLP, 1998.<br />
</ref><br />
| 24.4%<br />
|-<br />
| Monophone Randomly Initialized DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 23.4%<br />
|-<br />
| Monophone DBN-DNNs (Six Layers) <ref name=tuning_fb_DBN></ref><br />
| 22.4%<br />
|-<br />
| Monophone DBN-DNNs with MMI Training <ref name=finetuningDNN></ref><br />
| 22.1%<br />
|-<br />
| Triphone GMM-HMMs DT W/BMMI <ref name=tgmmhmmbmmi><br />
T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky,<br />
“Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE<br />
Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011.<br />
</ref><br />
| 21.7%<br />
|-<br />
| Monophone DBN-DNNs on FBank (Eight Layers) <ref name=tuning_fb_DBN></ref><br />
| 20.7%<br />
|-<br />
| Monophone MCRBM-DBN-DNNs On FBank (Five Layers) <ref name=mmcrbmdbndnn><br />
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition<br />
with the mean-covariance restricted Boltzmann machine,” in Advances in Neural<br />
Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-<br />
Taylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp.<br />
469–477.<br />
</ref><br />
| 20.5%<br />
|-<br />
| Monphone Convolutional DNNs On Fbank (Three Layers) <ref name=cdnnfb><br />
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional<br />
neural networks concepts to hybrid NN-HMM model for speech recognition,”<br />
in Proc. ICASSP, 2012, pp. 4277–4280.<br />
</ref><br />
| 20.0%<br />
|}<br />
<br />
<br />
<br />
= DNN for Large-Vocabulary Speech Recognition =<br />
<br />
The success of DNN-HMM system on TIMIT data set opened the door for trying the same technique on larger data sets. It was found that using context dependent HMM states is essential for large data sets. DNN-HMM systems were tested on five different large tasks and they outperform GMM-HMM systems on every task.<br />
<br />
== Bing-Voice-Search Speech Recognition Task == <br />
Bing-Voice-Search is a 24 hrs of speech data with different sources of acoustic variations such as noise, music, side-speech, accents, sloppy pronunciations, interruptions, and mobile phone differences. The DNN-HMM system was trained based on the DNN that worked well for TIMIT. it contained 5 hidden layers of size 2048 each. A window of 11 frames was used to classify the middle frame into the corresponding HMM state, and tri-phone states were used instead of monophones. The DNN-HMM system achieved a sentence accuracy of 69% compared to 63.8% for the GMM-HMM system. The accuracy for the DNN-HMM system was further improved by increasing the data size to 48 hrs achieving a sentence accuracy of 71.7% <ref name=bing><br />
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep<br />
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio<br />
Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.<br />
</ref>.<br />
== Switchboard Speech Recognition Task ==<br />
This data set contains over 300 hrs of speech data, and the test set is 6.3 hrs of speech. The same DNN-HMM system developed for Bing data set was applied to this data set. The DNN used contains 7 hidden layers of size 2048 each. A trigram language model was used, it was trained using 2000 hrs of speech corpus.<br />
This data set is publicly available which allows for rigorous comparisons among different techniques to be performed. As shown in the table below, DNN reduced the word error rate from 27.4% to 18.5% <ref name=switchboard><br />
F. Seide, G. Li, and D. Yu, “Conversational speech transcription using<br />
context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp.<br />
437–440.<br />
</ref><br />
. DNN system performed as well as a GMM system that combines several speaker-adaptive multipass systems and uses nearly 7 times as much acoustic data for training as the DNN system.<br />
{| class="wikitable"<br />
|+ Comparing five different DBN-DNN acoustic models with two strong GMM-HMM baseline systems that are discriminatively trained. "40 Mix" means a mixture of 40 gaussians per hmm state. Word error rates in % are shown for two separate test sets, HUB500-SWB and RT03S-FSH.<br />
|-<br />
! Technique<br />
! HUB5'00-SWB<br />
! RT03S-FSH<br />
|-<br />
| GMM, 40 MIX DT 309H SI<br />
| 23.6<br />
| 27.4<br />
|-<br />
| NN 1 HIDDEN-LAYER x 4,634 UNITS<br />
| 26.0<br />
| 29.4<br />
|-<br />
| + 2 x 5 NEIGHBORING FRAMES<br />
| 22.4<br />
| 25.7<br />
|-<br />
| DBN-DNN 7 HIDDEN LAYERS x 2,048 UNITS<br />
| 17.1<br />
| 19.6<br />
|-<br />
| + UPDATED STATE ALIGNMENT<br />
| 16.4<br />
| 18.6<br />
|-<br />
| + SPARSIFICATION<br />
| 16.1<br />
| 18.5<br />
|-<br />
| GMM 72 MIX DT 2000H SA<br />
| 17.1<br />
| 18.6<br />
|}<br />
<br />
== Google Voice Input Speech Recognition Task ==<br />
The data set contains search engine queries, short messages, events, and user actions from mobile devices. A well-trained GMM-HMM system was used to align 5870 hrs of speech for the DNN-HMM system. The DNN used is 4 hidden layer with 2560 hidden units per layer. A window of 11 frames was used to classify the middle frame, and 40 log filter-bank features were used to represent each frame. The network was fine-tuned to maximize the mutual information. The DNN was sparsified by setting weights that are less that certain threshold to zero. To further improve the accuracy, the DNN-HMM and the GMM-HMM models were combined using segmental conditional random field framework <ref name=scrf><br />
N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained<br />
deep neural networks to large vocabulary speech recognition,” submitted<br />
for publication.<br />
</ref>. Using DNN-HMM reduced the word error rate by 23% relative achieving a word error rate of 12.3%. combining both models (GMM and DNN) further reduced the word error rate to 11.8%.<br />
== Youtube Speech Recognition Task ==<br />
The goal for this task is to transcribe YouTube data. This type of data doesn't have strong language model to constraint the interpretation of the speech information, so a strong acoustic model is essential. A well-trained GMM-HMM system was used to align a 1400 hrs of data for the DNN model. A decision tree clustering algorithm was used to cluster the HMM states into 17552 context-dependent tri-phone states. Feature space maximum likelihood linear regression (fMLLR)-transformed features were used. Due to the large number of HMM states, only 4 hidden layer were used to save computation resources for the large softmax layer. After training, sequence-level fine tuning was performed. Also, both DNN-HMM and GMM-HMM models were combined to improve the accuracy. DNN-HMM system reduced the word error rate by 4.7% absolute, fine tuning reduced the word error rate by more 0.5%, and combining both models gave another reduction of 0.9%.<br />
== English Broadcast News Speech Recognition Task ==<br />
<br />
A GUMM-HMM baseline system was used to align 50 hrs of speech from 1996 and 1997 ENglish Broadcast News Speech Corpora. SAT and DT features were used to train the DNN. The network consists of six hidden layers of 1024 units each. The number of HMM states is 2220 triphone states. A window of 9 frames was used to classify the middle frame. Fine tuning was done after training to optimize the mutual information. The DNN-HMM system achieved a word error rate of 17.5% compared to 18.8% for the best GMM-HMM system <ref name=broadcastDNN><br />
T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Improvements in using<br />
deep belief networks for large vocabulary continuous speech recognition,”<br />
Speech and Language Algorithm Group, IBM, Yorktown Heights, NY, Tech.<br />
Rep. UTML TR 2010-003, Feb. 2011.<br />
</ref>.<br />
<br />
== Summary for the Main Results for DNN Acoustic Models on Large Data Sets ==<br />
<br />
The following table shows the performance of DNN-HMMs compared to GMM-HMMs, and it is obvious how DNNs are superior to GMM in terms of the accuracies.<br />
{| class="wikitable"<br />
|+ A comparison of the percentage WERs using DNN-HMMs and GMM-HMMs on five different large vocabulary tasks.<br />
|-<br />
! Task<br />
! Hours of training data<br />
! DNN-HMM<br />
! GMM-HMM<br />
! GMM-HMM using larger training data<br />
|-<br />
| SWITCHBOARD (TEST SET 1)<br />
| 309<br />
| 18.5<br />
| 27.4<br />
| 18.6 (2000h)<br />
|-<br />
| SWITCHBOARD (TEST SET 2)<br />
| 309<br />
| 16.1<br />
| 23.6<br />
| 17.1 (2000h)<br />
|-<br />
| ENGLISH BROADCAST NEWS<br />
| 50<br />
| 17.5<br />
| 18.8<br />
| <br />
|-<br />
| BING VOICE SEARCH (sentence error rates)<br />
| 24<br />
| 30.4<br />
| 36.2<br />
| <br />
|-<br />
| GOOGLE VOICE INPUT<br />
| 5870<br />
| 12.3<br />
| <br />
| 16.0 (>> 5870h)<br />
|-<br />
| YouTube<br />
| 1400<br />
| 47.6<br />
| 52.3<br />
| <br />
|}<br />
<br />
= Alternative Pretraining Methods for DNNs =<br />
Pretraining DNNs was reported to improve the results on TIMIT and large data sets tasks. The pretraining was done generatively using a stack of RBMs. Another way for pretraining is discriminative pretraining. In discriminative pretraining, we start from a shallow network and the weights are trained discriminatively. After that, another hidden layer is added between the last hidden layer and the softmax layer, then the weights for the new added layer is again discrimintively learned and so on. Finally backpropagation fine-tuning for the whole network is applied. This way of training was reported to achieve the same results achieved by generative pretraining <ref name=discrDNN><br />
F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent<br />
deep neural networks for conversational speech transcription,” in Proc.<br />
IEEE ASRU, 2011, pp. 24–29.<br />
</ref>.<br />
<br />
= Conclusions and Discussions =<br />
GMMs have been used widely for acoustic modelling, they are easy to train and quite flexible. Since 2009, DNNs were proposed to replace GMMs, they have been proven to be superior to GMMs in many speech recognition tasks. DNNs pretraining is essential when the amount of data is small, it reduces overfitting and the convergence time. Fine-tuning the network to optimize the mutual information can improve the results. The authors think that there are yet many things that can be done in pretraining, fine-tuning, and using different types of hidden units to further increase the performance of DNNs.<br />
<br />
This paper summarizes the recent research that has been done by three research groups in the area of speech recognition. They have shown that DNNs are superior to GMMs in both small and large dataset speech recognition tasks. The authors claimed that the reason for such superiority is that the speech data lies on a manifold which I am not sure if there is any scientific/empirical proof for such claim.<br />
This paper is an excellent source if someone is interested in deep learning in speech recognition. The author assume that the readers are familiar with speech recognition HMM framework, otherwise it will be difficult to follow. Also, the DNN area is moving fast, and the information in the paper is not up to date.<br />
<br />
= References =<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25677human-level control through deep reinforcement learning2015-10-30T15:15:34Z<p>Alcateri: /* The Bellman Equation in the Loss Framework */</p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84 x 84 x 4 images are the inputs to the network.<br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
== Training ==<br />
<br />
=== Framework and Additional Setup Details === <br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The goal is to use minimal prior knowledge and perform end-to-end training of these models based on game experience.<br />
<br />
The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. <br />
<br />
The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
=== Algorithm Background ===<br />
<br />
At each step in time, the agent selects an action <math>a_t\,</math> from the set of legal game actions <math>\mathbb{A}</math>. The agent observes an image <math>x_t \in \mathbb{R}^d</math> from the emulator, along with a reward <math>r_t\,</math>. It is impossible to fully understand the current game situation from a single screen, so a sequence of actions and observations <math>s_t = x_1, a_1, x_2, \ldots, a_{t-1}, x_t</math> is the input state. <br />
<br />
Recall that if we define <math>R_t = \sum_{t'=t}^T \gamma^{t'-t}r_t</math>, where <math>\gamma\,</math> is the discount factor and <math>\,T</math> is the step in which the game terminates, then <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[R_t| s_t=s, a_t=a, \pi\right]</math><br />
<br />
is the optimal action-value function. <br />
<br />
==== The Bellman Equation in the Loss Framework ====<br />
<br />
The optimal action-value function obeys the Bellman Equation:<br />
<br />
:<math>Q^*\left(s,a\right) = \mathop{\mathbb{E}_{s'}}\left[r + \gamma \max_{a'}Q^*\left(s',a'\right) | s, a\right] </math><br />
<br />
The intuition behind this identity is as follows: if the optimal value <math>Q^*(s',a')\,</math> at the next time step was known for all possible actions <math>a'</math>, then the optimal strategy is to select the action <math>a'</math> maximizing the expected value above <ref name = "main"> </ref>. Using the Bellman Equation as an iterative update formula is impractical, however, since the action-value function is estimated separately for each sequence and cannot generalize.<br />
<br />
It is necessary, in practice, to operate with an approximation of the action-value function. When a neural network with weights <math>\,\theta</math> is used, it is referred to as a Q-Network. A Q-Network is trained by adjusting <math>\,\theta_t</math> to reduce the mean-squared error in the Bellman Equation. The new target values for training are given by <math>y = r + \gamma\max_{a'} Q\left(s', a'; \theta_t^-\right)</math>, where <math>\theta_t^-\,</math> are the parameters from some previous iteration.<br />
<br />
=== The Full Algorithm === <br />
<br />
Now, with all the background behind us, the full Deep Q-Learning with Experience Replay algorithm is presented below:<br />
<br />
<br />
[[File:QLearning_Alg.JPG]]<br />
<br />
<br />
Some notes about the algorithm:<br />
* Replay memory is used to implement the experience replay technique described above<br />
* An episode is one game<br />
* Correlations between target values and the action function <math>Q\,</math>are mitigated by using <math>\hat{Q}</math> for the target values<br />
** Only updated every <math>\,C</math> steps<br />
<br />
== Results ==<br />
<br />
=== Evaluation Procedure === <br />
<br />
The trained networks played each game 30 times, up to 5 minutes at a time. The random agent, which is the baseline comparison, chooses a random action every 6 frames (10 Hz). The human player uses the same emulator as the agents, and played under controlled conditions (most notably without sound). The human performance is the average reward from around 20 episodes of the game lasting up to 5 minutes, after 2 hours of practice playing each game. The human performance is set to be 100%, and the random agent has performance set to 0%.<br />
<br />
=== Raw Score Results ===<br />
<br />
The DQN agent outperforms the best existing reinforcement learning methods on 43 of the games without incorporating prior knowledge about Atari games. Furthermore, the agent scores at least 75% of the human score on more than half of the games. Also, DQN performs well across a number of types of games. However, games which involve extended planning strategies still pose a major problem to DQN (e.g. Montezuma's Revenge). These results are visualized in the figure below: <br />
<br />
[[File:Performance.JPG ]]<br />
<br />
=== Results with model components removed ===<br />
<br />
Two important advances in this paper were presented: experience replay and creating a separate network to evaluate the targets. To visualize the impact of these advances, the network was trained with and without both of these concepts, and evaluated on its performance in each case. The results are shown in the table below, in percentage form as above:<br />
<br />
{| class="wikitable"<br />
|-<br />
! Game<br />
! With Replay and Target Q<br />
! With Replay, Without Target Q<br />
! Without Replay, With Target Q<br />
! Without Replay, Without Target Q<br />
|-<br />
| Breakout<br />
| 316.8<br />
| 240.7<br />
| 10.2<br />
| 3.2<br />
|-<br />
| Enduro<br />
| 1006.3<br />
| 831.4<br />
| 141.9<br />
| 29.1<br />
|- <br />
| River Raid<br />
| 7446.6<br />
| 4102.8<br />
| 2867.7<br />
| 1453.0<br />
|-<br />
| Seaquest<br />
| 2894.4<br />
| 822.6<br />
| 1003.0<br />
| 275.8<br />
|-<br />
| Space Invaders<br />
| 1088.9<br />
| 826.3<br />
| 373.2<br />
| 302.0<br />
|}<br />
<br />
Clearly, experience replay and maintaining a secondary network for computing target values are important. From these results, it seems that experience replay is more important on its own, except in Seaquest.<br />
<br />
== Conclusion == <br />
<br />
The framework presented has demonstrated the ability to learn how to play Atari games, given minimal prior knowledge of the game and very basic inputs. Using reinforcement learning with the Q-network architecture was more effective than previous similar attempts, since experience replay and a separate target network were utilized in training. These two modifications removed correlations between sequential inputs, which improved stability in the network. Future work should be undertaken to improve the experience replay algorithm: instead of sampling uniformly from the replay memory, the sampling should be biased towards high-reward events. However, this may add a layer of instability to the network, but it is certainly worth investigating.<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25676human-level control through deep reinforcement learning2015-10-30T15:10:43Z<p>Alcateri: /* Data & Preprocessing */</p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84 x 84 x 4 images are the inputs to the network.<br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
== Training ==<br />
<br />
=== Framework and Additional Setup Details === <br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The goal is to use minimal prior knowledge and perform end-to-end training of these models based on game experience.<br />
<br />
The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. <br />
<br />
The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
=== Algorithm Background ===<br />
<br />
At each step in time, the agent selects an action <math>a_t\,</math> from the set of legal game actions <math>\mathbb{A}</math>. The agent observes an image <math>x_t \in \mathbb{R}^d</math> from the emulator, along with a reward <math>r_t\,</math>. It is impossible to fully understand the current game situation from a single screen, so a sequence of actions and observations <math>s_t = x_1, a_1, x_2, \ldots, a_{t-1}, x_t</math> is the input state. <br />
<br />
Recall that if we define <math>R_t = \sum_{t'=t}^T \gamma^{t'-t}r_t</math>, where <math>\gamma\,</math> is the discount factor and <math>\,T</math> is the step in which the game terminates, then <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[R_t| s_t=s, a_t=a, \pi\right]</math><br />
<br />
is the optimal action-value function. <br />
<br />
==== The Bellman Equation in the Loss Framework ====<br />
<br />
The optimal action-value function obeys the Bellman Equation:<br />
<br />
:<math>Q^*\left(s,a\right) = \mathop{\mathbb{E}_{s'}}\left[r + \gamma \max_{a'}Q^*\left(s',a'\right) | s, a\right] </math><br />
<br />
The intuition behind this identity is as follows: if the optimal value <math>Q^*(s',a')\,</math> at the next time step was known for all possible actions <math>a'\,</math>, then the optimal strategy is to select the action <math>a'</math> maximizing the expected value above <ref name = "main"> </ref>. Using the Bellman Equation as an iterative update formula is impractical, however, since the action-value function is estimated separately for each sequence and cannot generalize.<br />
<br />
It is necessary, in practice, to operate with an approximation of the action-value function. When a neural network with weights <math>\,\theta</math> is used, it is referred to as a Q-Network. A Q-Network is trained by adjusting <math>\,\theta_t</math> to reduce the mean-squared error in the Bellman Equation. The new target values for training are given by <math>y = r + \gamma\max_{a'} Q\left(s', a'; \theta_t^-\right)</math>, where <math>\theta_t^-\,</math> are the parameters from some previous iteration.<br />
<br />
=== The Full Algorithm === <br />
<br />
Now, with all the background behind us, the full Deep Q-Learning with Experience Replay algorithm is presented below:<br />
<br />
<br />
[[File:QLearning_Alg.JPG]]<br />
<br />
<br />
Some notes about the algorithm:<br />
* Replay memory is used to implement the experience replay technique described above<br />
* An episode is one game<br />
* Correlations between target values and the action function <math>Q\,</math>are mitigated by using <math>\hat{Q}</math> for the target values<br />
** Only updated every <math>\,C</math> steps<br />
<br />
== Results ==<br />
<br />
=== Evaluation Procedure === <br />
<br />
The trained networks played each game 30 times, up to 5 minutes at a time. The random agent, which is the baseline comparison, chooses a random action every 6 frames (10 Hz). The human player uses the same emulator as the agents, and played under controlled conditions (most notably without sound). The human performance is the average reward from around 20 episodes of the game lasting up to 5 minutes, after 2 hours of practice playing each game. The human performance is set to be 100%, and the random agent has performance set to 0%.<br />
<br />
=== Raw Score Results ===<br />
<br />
The DQN agent outperforms the best existing reinforcement learning methods on 43 of the games without incorporating prior knowledge about Atari games. Furthermore, the agent scores at least 75% of the human score on more than half of the games. Also, DQN performs well across a number of types of games. However, games which involve extended planning strategies still pose a major problem to DQN (e.g. Montezuma's Revenge). These results are visualized in the figure below: <br />
<br />
[[File:Performance.JPG ]]<br />
<br />
=== Results with model components removed ===<br />
<br />
Two important advances in this paper were presented: experience replay and creating a separate network to evaluate the targets. To visualize the impact of these advances, the network was trained with and without both of these concepts, and evaluated on its performance in each case. The results are shown in the table below, in percentage form as above:<br />
<br />
{| class="wikitable"<br />
|-<br />
! Game<br />
! With Replay and Target Q<br />
! With Replay, Without Target Q<br />
! Without Replay, With Target Q<br />
! Without Replay, Without Target Q<br />
|-<br />
| Breakout<br />
| 316.8<br />
| 240.7<br />
| 10.2<br />
| 3.2<br />
|-<br />
| Enduro<br />
| 1006.3<br />
| 831.4<br />
| 141.9<br />
| 29.1<br />
|- <br />
| River Raid<br />
| 7446.6<br />
| 4102.8<br />
| 2867.7<br />
| 1453.0<br />
|-<br />
| Seaquest<br />
| 2894.4<br />
| 822.6<br />
| 1003.0<br />
| 275.8<br />
|-<br />
| Space Invaders<br />
| 1088.9<br />
| 826.3<br />
| 373.2<br />
| 302.0<br />
|}<br />
<br />
Clearly, experience replay and maintaining a secondary network for computing target values are important. From these results, it seems that experience replay is more important on its own, except in Seaquest.<br />
<br />
== Conclusion == <br />
<br />
The framework presented has demonstrated the ability to learn how to play Atari games, given minimal prior knowledge of the game and very basic inputs. Using reinforcement learning with the Q-network architecture was more effective than previous similar attempts, since experience replay and a separate target network were utilized in training. These two modifications removed correlations between sequential inputs, which improved stability in the network. Future work should be undertaken to improve the experience replay algorithm: instead of sampling uniformly from the replay memory, the sampling should be biased towards high-reward events. However, this may add a layer of instability to the network, but it is certainly worth investigating.<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25675human-level control through deep reinforcement learning2015-10-30T15:05:59Z<p>Alcateri: /* Introduction */</p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
== Training ==<br />
<br />
=== Framework and Additional Setup Details === <br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The goal is to use minimal prior knowledge and perform end-to-end training of these models based on game experience.<br />
<br />
The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. <br />
<br />
The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
=== Algorithm Background ===<br />
<br />
At each step in time, the agent selects an action <math>a_t\,</math> from the set of legal game actions <math>\mathbb{A}</math>. The agent observes an image <math>x_t \in \mathbb{R}^d</math> from the emulator, along with a reward <math>r_t\,</math>. It is impossible to fully understand the current game situation from a single screen, so a sequence of actions and observations <math>s_t = x_1, a_1, x_2, \ldots, a_{t-1}, x_t</math> is the input state. <br />
<br />
Recall that if we define <math>R_t = \sum_{t'=t}^T \gamma^{t'-t}r_t</math>, where <math>\gamma\,</math> is the discount factor and <math>\,T</math> is the step in which the game terminates, then <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[R_t| s_t=s, a_t=a, \pi\right]</math><br />
<br />
is the optimal action-value function. <br />
<br />
==== The Bellman Equation in the Loss Framework ====<br />
<br />
The optimal action-value function obeys the Bellman Equation:<br />
<br />
:<math>Q^*\left(s,a\right) = \mathop{\mathbb{E}_{s'}}\left[r + \gamma \max_{a'}Q^*\left(s',a'\right) | s, a\right] </math><br />
<br />
The intuition behind this identity is as follows: if the optimal value <math>Q^*(s',a')\,</math> at the next time step was known for all possible actions <math>a'\,</math>, then the optimal strategy is to select the action <math>a'</math> maximizing the expected value above <ref name = "main"> </ref>. Using the Bellman Equation as an iterative update formula is impractical, however, since the action-value function is estimated separately for each sequence and cannot generalize.<br />
<br />
It is necessary, in practice, to operate with an approximation of the action-value function. When a neural network with weights <math>\,\theta</math> is used, it is referred to as a Q-Network. A Q-Network is trained by adjusting <math>\,\theta_t</math> to reduce the mean-squared error in the Bellman Equation. The new target values for training are given by <math>y = r + \gamma\max_{a'} Q\left(s', a'; \theta_t^-\right)</math>, where <math>\theta_t^-\,</math> are the parameters from some previous iteration.<br />
<br />
=== The Full Algorithm === <br />
<br />
Now, with all the background behind us, the full Deep Q-Learning with Experience Replay algorithm is presented below:<br />
<br />
<br />
[[File:QLearning_Alg.JPG]]<br />
<br />
<br />
Some notes about the algorithm:<br />
* Replay memory is used to implement the experience replay technique described above<br />
* An episode is one game<br />
* Correlations between target values and the action function <math>Q\,</math>are mitigated by using <math>\hat{Q}</math> for the target values<br />
** Only updated every <math>\,C</math> steps<br />
<br />
== Results ==<br />
<br />
=== Evaluation Procedure === <br />
<br />
The trained networks played each game 30 times, up to 5 minutes at a time. The random agent, which is the baseline comparison, chooses a random action every 6 frames (10 Hz). The human player uses the same emulator as the agents, and played under controlled conditions (most notably without sound). The human performance is the average reward from around 20 episodes of the game lasting up to 5 minutes, after 2 hours of practice playing each game. The human performance is set to be 100%, and the random agent has performance set to 0%.<br />
<br />
=== Raw Score Results ===<br />
<br />
The DQN agent outperforms the best existing reinforcement learning methods on 43 of the games without incorporating prior knowledge about Atari games. Furthermore, the agent scores at least 75% of the human score on more than half of the games. Also, DQN performs well across a number of types of games. However, games which involve extended planning strategies still pose a major problem to DQN (e.g. Montezuma's Revenge). These results are visualized in the figure below: <br />
<br />
[[File:Performance.JPG ]]<br />
<br />
=== Results with model components removed ===<br />
<br />
Two important advances in this paper were presented: experience replay and creating a separate network to evaluate the targets. To visualize the impact of these advances, the network was trained with and without both of these concepts, and evaluated on its performance in each case. The results are shown in the table below, in percentage form as above:<br />
<br />
{| class="wikitable"<br />
|-<br />
! Game<br />
! With Replay and Target Q<br />
! With Replay, Without Target Q<br />
! Without Replay, With Target Q<br />
! Without Replay, Without Target Q<br />
|-<br />
| Breakout<br />
| 316.8<br />
| 240.7<br />
| 10.2<br />
| 3.2<br />
|-<br />
| Enduro<br />
| 1006.3<br />
| 831.4<br />
| 141.9<br />
| 29.1<br />
|- <br />
| River Raid<br />
| 7446.6<br />
| 4102.8<br />
| 2867.7<br />
| 1453.0<br />
|-<br />
| Seaquest<br />
| 2894.4<br />
| 822.6<br />
| 1003.0<br />
| 275.8<br />
|-<br />
| Space Invaders<br />
| 1088.9<br />
| 826.3<br />
| 373.2<br />
| 302.0<br />
|}<br />
<br />
Clearly, experience replay and maintaining a secondary network for computing target values are important. From these results, it seems that experience replay is more important on its own, except in Seaquest.<br />
<br />
== Conclusion == <br />
<br />
The framework presented has demonstrated the ability to learn how to play Atari games, given minimal prior knowledge of the game and very basic inputs. Using reinforcement learning with the Q-network architecture was more effective than previous similar attempts, since experience replay and a separate target network were utilized in training. These two modifications removed correlations between sequential inputs, which improved stability in the network. Future work should be undertaken to improve the experience replay algorithm: instead of sampling uniformly from the replay memory, the sampling should be biased towards high-reward events. However, this may add a layer of instability to the network, but it is certainly worth investigating.<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25674human-level control through deep reinforcement learning2015-10-30T15:05:16Z<p>Alcateri: /* Introduction */</p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
== Training ==<br />
<br />
=== Framework and Additional Setup Details === <br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The goal is to use minimal prior knowledge and perform end-to-end training of these models based on game experience.<br />
<br />
The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. <br />
<br />
The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
=== Algorithm Background ===<br />
<br />
At each step in time, the agent selects an action <math>a_t\,</math> from the set of legal game actions <math>\mathbb{A}</math>. The agent observes an image <math>x_t \in \mathbb{R}^d</math> from the emulator, along with a reward <math>r_t\,</math>. It is impossible to fully understand the current game situation from a single screen, so a sequence of actions and observations <math>s_t = x_1, a_1, x_2, \ldots, a_{t-1}, x_t</math> is the input state. <br />
<br />
Recall that if we define <math>R_t = \sum_{t'=t}^T \gamma^{t'-t}r_t</math>, where <math>\gamma\,</math> is the discount factor and <math>\,T</math> is the step in which the game terminates, then <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[R_t| s_t=s, a_t=a, \pi\right]</math><br />
<br />
is the optimal action-value function. <br />
<br />
==== The Bellman Equation in the Loss Framework ====<br />
<br />
The optimal action-value function obeys the Bellman Equation:<br />
<br />
:<math>Q^*\left(s,a\right) = \mathop{\mathbb{E}_{s'}}\left[r + \gamma \max_{a'}Q^*\left(s',a'\right) | s, a\right] </math><br />
<br />
The intuition behind this identity is as follows: if the optimal value <math>Q^*(s',a')\,</math> at the next time step was known for all possible actions <math>a'\,</math>, then the optimal strategy is to select the action <math>a'</math> maximizing the expected value above <ref name = "main"> </ref>. Using the Bellman Equation as an iterative update formula is impractical, however, since the action-value function is estimated separately for each sequence and cannot generalize.<br />
<br />
It is necessary, in practice, to operate with an approximation of the action-value function. When a neural network with weights <math>\,\theta</math> is used, it is referred to as a Q-Network. A Q-Network is trained by adjusting <math>\,\theta_t</math> to reduce the mean-squared error in the Bellman Equation. The new target values for training are given by <math>y = r + \gamma\max_{a'} Q\left(s', a'; \theta_t^-\right)</math>, where <math>\theta_t^-\,</math> are the parameters from some previous iteration.<br />
<br />
=== The Full Algorithm === <br />
<br />
Now, with all the background behind us, the full Deep Q-Learning with Experience Replay algorithm is presented below:<br />
<br />
<br />
[[File:QLearning_Alg.JPG]]<br />
<br />
<br />
Some notes about the algorithm:<br />
* Replay memory is used to implement the experience replay technique described above<br />
* An episode is one game<br />
* Correlations between target values and the action function <math>Q\,</math>are mitigated by using <math>\hat{Q}</math> for the target values<br />
** Only updated every <math>\,C</math> steps<br />
<br />
== Results ==<br />
<br />
=== Evaluation Procedure === <br />
<br />
The trained networks played each game 30 times, up to 5 minutes at a time. The random agent, which is the baseline comparison, chooses a random action every 6 frames (10 Hz). The human player uses the same emulator as the agents, and played under controlled conditions (most notably without sound). The human performance is the average reward from around 20 episodes of the game lasting up to 5 minutes, after 2 hours of practice playing each game. The human performance is set to be 100%, and the random agent has performance set to 0%.<br />
<br />
=== Raw Score Results ===<br />
<br />
The DQN agent outperforms the best existing reinforcement learning methods on 43 of the games without incorporating prior knowledge about Atari games. Furthermore, the agent scores at least 75% of the human score on more than half of the games. Also, DQN performs well across a number of types of games. However, games which involve extended planning strategies still pose a major problem to DQN (e.g. Montezuma's Revenge). These results are visualized in the figure below: <br />
<br />
[[File:Performance.JPG ]]<br />
<br />
=== Results with model components removed ===<br />
<br />
Two important advances in this paper were presented: experience replay and creating a separate network to evaluate the targets. To visualize the impact of these advances, the network was trained with and without both of these concepts, and evaluated on its performance in each case. The results are shown in the table below, in percentage form as above:<br />
<br />
{| class="wikitable"<br />
|-<br />
! Game<br />
! With Replay and Target Q<br />
! With Replay, Without Target Q<br />
! Without Replay, With Target Q<br />
! Without Replay, Without Target Q<br />
|-<br />
| Breakout<br />
| 316.8<br />
| 240.7<br />
| 10.2<br />
| 3.2<br />
|-<br />
| Enduro<br />
| 1006.3<br />
| 831.4<br />
| 141.9<br />
| 29.1<br />
|- <br />
| River Raid<br />
| 7446.6<br />
| 4102.8<br />
| 2867.7<br />
| 1453.0<br />
|-<br />
| Seaquest<br />
| 2894.4<br />
| 822.6<br />
| 1003.0<br />
| 275.8<br />
|-<br />
| Space Invaders<br />
| 1088.9<br />
| 826.3<br />
| 373.2<br />
| 302.0<br />
|}<br />
<br />
Clearly, experience replay and maintaining a secondary network for computing target values are important. From these results, it seems that experience replay is more important on its own, except in Seaquest.<br />
<br />
== Conclusion == <br />
<br />
The framework presented has demonstrated the ability to learn how to play Atari games, given minimal prior knowledge of the game and very basic inputs. Using reinforcement learning with the Q-network architecture was more effective than previous similar attempts, since experience replay and a separate target network were utilized in training. These two modifications removed correlations between sequential inputs, which improved stability in the network. Future work should be undertaken to improve the experience replay algorithm: instead of sampling uniformly from the replay memory, the sampling should be biased towards high-reward events. However, this may add a layer of instability to the network, but it is certainly worth investigating.<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25673human-level control through deep reinforcement learning2015-10-30T15:04:26Z<p>Alcateri: /* Introduction */</p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
== Training ==<br />
<br />
=== Framework and Additional Setup Details === <br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The goal is to use minimal prior knowledge and perform end-to-end training of these models based on game experience.<br />
<br />
The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. <br />
<br />
The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
=== Algorithm Background ===<br />
<br />
At each step in time, the agent selects an action <math>a_t\,</math> from the set of legal game actions <math>\mathbb{A}</math>. The agent observes an image <math>x_t \in \mathbb{R}^d</math> from the emulator, along with a reward <math>r_t\,</math>. It is impossible to fully understand the current game situation from a single screen, so a sequence of actions and observations <math>s_t = x_1, a_1, x_2, \ldots, a_{t-1}, x_t</math> is the input state. <br />
<br />
Recall that if we define <math>R_t = \sum_{t'=t}^T \gamma^{t'-t}r_t</math>, where <math>\gamma\,</math> is the discount factor and <math>\,T</math> is the step in which the game terminates, then <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[R_t| s_t=s, a_t=a, \pi\right]</math><br />
<br />
is the optimal action-value function. <br />
<br />
==== The Bellman Equation in the Loss Framework ====<br />
<br />
The optimal action-value function obeys the Bellman Equation:<br />
<br />
:<math>Q^*\left(s,a\right) = \mathop{\mathbb{E}_{s'}}\left[r + \gamma \max_{a'}Q^*\left(s',a'\right) | s, a\right] </math><br />
<br />
The intuition behind this identity is as follows: if the optimal value <math>Q^*(s',a')\,</math> at the next time step was known for all possible actions <math>a'\,</math>, then the optimal strategy is to select the action <math>a'</math> maximizing the expected value above <ref name = "main"> </ref>. Using the Bellman Equation as an iterative update formula is impractical, however, since the action-value function is estimated separately for each sequence and cannot generalize.<br />
<br />
It is necessary, in practice, to operate with an approximation of the action-value function. When a neural network with weights <math>\,\theta</math> is used, it is referred to as a Q-Network. A Q-Network is trained by adjusting <math>\,\theta_t</math> to reduce the mean-squared error in the Bellman Equation. The new target values for training are given by <math>y = r + \gamma\max_{a'} Q\left(s', a'; \theta_t^-\right)</math>, where <math>\theta_t^-\,</math> are the parameters from some previous iteration.<br />
<br />
=== The Full Algorithm === <br />
<br />
Now, with all the background behind us, the full Deep Q-Learning with Experience Replay algorithm is presented below:<br />
<br />
<br />
[[File:QLearning_Alg.JPG]]<br />
<br />
<br />
Some notes about the algorithm:<br />
* Replay memory is used to implement the experience replay technique described above<br />
* An episode is one game<br />
* Correlations between target values and the action function <math>Q\,</math>are mitigated by using <math>\hat{Q}</math> for the target values<br />
** Only updated every <math>\,C</math> steps<br />
<br />
== Results ==<br />
<br />
=== Evaluation Procedure === <br />
<br />
The trained networks played each game 30 times, up to 5 minutes at a time. The random agent, which is the baseline comparison, chooses a random action every 6 frames (10 Hz). The human player uses the same emulator as the agents, and played under controlled conditions (most notably without sound). The human performance is the average reward from around 20 episodes of the game lasting up to 5 minutes, after 2 hours of practice playing each game. The human performance is set to be 100%, and the random agent has performance set to 0%.<br />
<br />
=== Raw Score Results ===<br />
<br />
The DQN agent outperforms the best existing reinforcement learning methods on 43 of the games without incorporating prior knowledge about Atari games. Furthermore, the agent scores at least 75% of the human score on more than half of the games. Also, DQN performs well across a number of types of games. However, games which involve extended planning strategies still pose a major problem to DQN (e.g. Montezuma's Revenge). These results are visualized in the figure below: <br />
<br />
[[File:Performance.JPG ]]<br />
<br />
=== Results with model components removed ===<br />
<br />
Two important advances in this paper were presented: experience replay and creating a separate network to evaluate the targets. To visualize the impact of these advances, the network was trained with and without both of these concepts, and evaluated on its performance in each case. The results are shown in the table below, in percentage form as above:<br />
<br />
{| class="wikitable"<br />
|-<br />
! Game<br />
! With Replay and Target Q<br />
! With Replay, Without Target Q<br />
! Without Replay, With Target Q<br />
! Without Replay, Without Target Q<br />
|-<br />
| Breakout<br />
| 316.8<br />
| 240.7<br />
| 10.2<br />
| 3.2<br />
|-<br />
| Enduro<br />
| 1006.3<br />
| 831.4<br />
| 141.9<br />
| 29.1<br />
|- <br />
| River Raid<br />
| 7446.6<br />
| 4102.8<br />
| 2867.7<br />
| 1453.0<br />
|-<br />
| Seaquest<br />
| 2894.4<br />
| 822.6<br />
| 1003.0<br />
| 275.8<br />
|-<br />
| Space Invaders<br />
| 1088.9<br />
| 826.3<br />
| 373.2<br />
| 302.0<br />
|}<br />
<br />
Clearly, experience replay and maintaining a secondary network for computing target values are important. From these results, it seems that experience replay is more important on its own, except in Seaquest.<br />
<br />
== Conclusion == <br />
<br />
The framework presented has demonstrated the ability to learn how to play Atari games, given minimal prior knowledge of the game and very basic inputs. Using reinforcement learning with the Q-network architecture was more effective than previous similar attempts, since experience replay and a separate target network were utilized in training. These two modifications removed correlations between sequential inputs, which improved stability in the network. Future work should be undertaken to improve the experience replay algorithm: instead of sampling uniformly from the replay memory, the sampling should be biased towards high-reward events. However, this may add a layer of instability to the network, but it is certainly worth investigating.<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25672human-level control through deep reinforcement learning2015-10-30T15:03:49Z<p>Alcateri: </p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
== Training ==<br />
<br />
=== Framework and Additional Setup Details === <br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The goal is to use minimal prior knowledge and perform end-to-end training of these models based on game experience.<br />
<br />
The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. <br />
<br />
The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
=== Algorithm Background ===<br />
<br />
At each step in time, the agent selects an action <math>a_t\,</math> from the set of legal game actions <math>\mathbb{A}</math>. The agent observes an image <math>x_t \in \mathbb{R}^d</math> from the emulator, along with a reward <math>r_t\,</math>. It is impossible to fully understand the current game situation from a single screen, so a sequence of actions and observations <math>s_t = x_1, a_1, x_2, \ldots, a_{t-1}, x_t</math> is the input state. <br />
<br />
Recall that if we define <math>R_t = \sum_{t'=t}^T \gamma^{t'-t}r_t</math>, where <math>\gamma\,</math> is the discount factor and <math>\,T</math> is the step in which the game terminates, then <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[R_t| s_t=s, a_t=a, \pi\right]</math><br />
<br />
is the optimal action-value function. <br />
<br />
==== The Bellman Equation in the Loss Framework ====<br />
<br />
The optimal action-value function obeys the Bellman Equation:<br />
<br />
:<math>Q^*\left(s,a\right) = \mathop{\mathbb{E}_{s'}}\left[r + \gamma \max_{a'}Q^*\left(s',a'\right) | s, a\right] </math><br />
<br />
The intuition behind this identity is as follows: if the optimal value <math>Q^*(s',a')\,</math> at the next time step was known for all possible actions <math>a'\,</math>, then the optimal strategy is to select the action <math>a'</math> maximizing the expected value above <ref name = "main"> </ref>. Using the Bellman Equation as an iterative update formula is impractical, however, since the action-value function is estimated separately for each sequence and cannot generalize.<br />
<br />
It is necessary, in practice, to operate with an approximation of the action-value function. When a neural network with weights <math>\,\theta</math> is used, it is referred to as a Q-Network. A Q-Network is trained by adjusting <math>\,\theta_t</math> to reduce the mean-squared error in the Bellman Equation. The new target values for training are given by <math>y = r + \gamma\max_{a'} Q\left(s', a'; \theta_t^-\right)</math>, where <math>\theta_t^-\,</math> are the parameters from some previous iteration.<br />
<br />
=== The Full Algorithm === <br />
<br />
Now, with all the background behind us, the full Deep Q-Learning with Experience Replay algorithm is presented below:<br />
<br />
<br />
[[File:QLearning_Alg.JPG]]<br />
<br />
<br />
Some notes about the algorithm:<br />
* Replay memory is used to implement the experience replay technique described above<br />
* An episode is one game<br />
* Correlations between target values and the action function <math>Q\,</math>are mitigated by using <math>\hat{Q}</math> for the target values<br />
** Only updated every <math>\,C</math> steps<br />
<br />
== Results ==<br />
<br />
=== Evaluation Procedure === <br />
<br />
The trained networks played each game 30 times, up to 5 minutes at a time. The random agent, which is the baseline comparison, chooses a random action every 6 frames (10 Hz). The human player uses the same emulator as the agents, and played under controlled conditions (most notably without sound). The human performance is the average reward from around 20 episodes of the game lasting up to 5 minutes, after 2 hours of practice playing each game. The human performance is set to be 100%, and the random agent has performance set to 0%.<br />
<br />
=== Raw Score Results ===<br />
<br />
The DQN agent outperforms the best existing reinforcement learning methods on 43 of the games without incorporating prior knowledge about Atari games. Furthermore, the agent scores at least 75% of the human score on more than half of the games. Also, DQN performs well across a number of types of games. However, games which involve extended planning strategies still pose a major problem to DQN (e.g. Montezuma's Revenge). These results are visualized in the figure below: <br />
<br />
[[File:Performance.JPG ]]<br />
<br />
=== Results with model components removed ===<br />
<br />
Two important advances in this paper were presented: experience replay and creating a separate network to evaluate the targets. To visualize the impact of these advances, the network was trained with and without both of these concepts, and evaluated on its performance in each case. The results are shown in the table below, in percentage form as above:<br />
<br />
{| class="wikitable"<br />
|-<br />
! Game<br />
! With Replay and Target Q<br />
! With Replay, Without Target Q<br />
! Without Replay, With Target Q<br />
! Without Replay, Without Target Q<br />
|-<br />
| Breakout<br />
| 316.8<br />
| 240.7<br />
| 10.2<br />
| 3.2<br />
|-<br />
| Enduro<br />
| 1006.3<br />
| 831.4<br />
| 141.9<br />
| 29.1<br />
|- <br />
| River Raid<br />
| 7446.6<br />
| 4102.8<br />
| 2867.7<br />
| 1453.0<br />
|-<br />
| Seaquest<br />
| 2894.4<br />
| 822.6<br />
| 1003.0<br />
| 275.8<br />
|-<br />
| Space Invaders<br />
| 1088.9<br />
| 826.3<br />
| 373.2<br />
| 302.0<br />
|}<br />
<br />
Clearly, experience replay and maintaining a secondary network for computing target values are important. From these results, it seems that experience replay is more important on its own, except in Seaquest.<br />
<br />
== Conclusion == <br />
<br />
The framework presented has demonstrated the ability to learn how to play Atari games, given minimal prior knowledge of the game and very basic inputs. Using reinforcement learning with the Q-network architecture was more effective than previous similar attempts, since experience replay and a separate target network were utilized in training. These two modifications removed correlations between sequential inputs, which improved stability in the network. Future work should be undertaken to improve the experience replay algorithm: instead of sampling uniformly from the replay memory, the sampling should be biased towards high-reward events. However, this may add a layer of instability to the network, but it is certainly worth investigating.<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25671human-level control through deep reinforcement learning2015-10-30T14:50:57Z<p>Alcateri: /* Results */</p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
== Training ==<br />
<br />
=== Framework and Additional Setup Details === <br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The goal is to use minimal prior knowledge and perform end-to-end training of these models based on game experience.<br />
<br />
The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. <br />
<br />
The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
=== Algorithm Background ===<br />
<br />
At each step in time, the agent selects an action <math>a_t\,</math> from the set of legal game actions <math>\mathbb{A}</math>. The agent observes an image <math>x_t \in \mathbb{R}^d</math> from the emulator, along with a reward <math>r_t\,</math>. It is impossible to fully understand the current game situation from a single screen, so a sequence of actions and observations <math>s_t = x_1, a_1, x_2, \ldots, a_{t-1}, x_t</math> is the input state. <br />
<br />
Recall that if we define <math>R_t = \sum_{t'=t}^T \gamma^{t'-t}r_t</math>, where <math>\gamma\,</math> is the discount factor and <math>\,T</math> is the step in which the game terminates, then <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[R_t| s_t=s, a_t=a, \pi\right]</math><br />
<br />
is the optimal action-value function. <br />
<br />
==== The Bellman Equation in the Loss Framework ====<br />
<br />
The optimal action-value function obeys the Bellman Equation:<br />
<br />
:<math>Q^*\left(s,a\right) = \mathop{\mathbb{E}_{s'}}\left[r + \gamma \max_{a'}Q^*\left(s',a'\right) | s, a\right] </math><br />
<br />
The intuition behind this identity is as follows: if the optimal value <math>Q^*(s',a')\,</math> at the next time step was known for all possible actions <math>a'\,</math>, then the optimal strategy is to select the action <math>a'</math> maximizing the expected value above <ref name = "main"> </ref>. Using the Bellman Equation as an iterative update formula is impractical, however, since the action-value function is estimated separately for each sequence and cannot generalize.<br />
<br />
It is necessary, in practice, to operate with an approximation of the action-value function. When a neural network with weights <math>\,\theta</math> is used, it is referred to as a Q-Network. A Q-Network is trained by adjusting <math>\,\theta_t</math> to reduce the mean-squared error in the Bellman Equation. The new target values for training are given by <math>y = r + \gamma\max_{a'} Q\left(s', a'; \theta_t^-\right)</math>, where <math>\theta_t^-\,</math> are the parameters from some previous iteration.<br />
<br />
=== The Full Algorithm === <br />
<br />
Now, with all the background behind us, the full Deep Q-Learning with Experience Replay algorithm is presented below:<br />
<br />
<br />
[[File:QLearning_Alg.JPG]]<br />
<br />
<br />
Some notes about the algorithm:<br />
* Replay memory is used to implement the experience replay technique described above<br />
* An episode is one game<br />
* Correlations between target values and the action function <math>Q\,</math>are mitigated by using <math>\hat{Q}</math> for the target values<br />
** Only updated every <math>\,C</math> steps<br />
<br />
== Results ==<br />
<br />
=== Evaluation Procedure === <br />
<br />
The trained networks played each game 30 times, up to 5 minutes at a time. The random agent, which is the baseline comparison, chooses a random action every 6 frames (10 Hz). The human player uses the same emulator as the agents, and played under controlled conditions (most notably without sound). The human performance is the average reward from around 20 episodes of the game lasting up to 5 minutes, after 2 hours of practice playing each game. The human performance is set to be 100%, and the random agent has performance set to 0%.<br />
<br />
=== Raw Score Results ===<br />
<br />
The DQN agent outperforms the best existing reinforcement learning methods on 43 of the games without incorporating prior knowledge about Atari games. Furthermore, the agent scores at least 75% of the human score on more than half of the games. Also, DQN performs well across a number of types of games. However, games which involve extended planning strategies still pose a major problem to DQN (e.g. Montezuma's Revenge). These results are visualized in the figure below: <br />
<br />
[[File:Performance.JPG ]]<br />
<br />
=== Results with model components removed ===<br />
<br />
Two important advances in this paper were presented: experience replay and creating a separate network to evaluate the targets. To visualize the impact of these advances, the network was trained with and without both of these concepts, and evaluated on its performance in each case. The results are shown in the table below, in percentage form as above:<br />
<br />
{| class="wikitable"<br />
|-<br />
! Game<br />
! With Replay and Target Q<br />
! With Replay, Without Target Q<br />
! Without Replay, With Target Q<br />
! Without Replay, Without Target Q<br />
|-<br />
| Breakout<br />
| 316.8<br />
| 240.7<br />
| 10.2<br />
| 3.2<br />
|-<br />
| Enduro<br />
| 1006.3<br />
| 831.4<br />
| 141.9<br />
| 29.1<br />
|- <br />
| River Raid<br />
| 7446.6<br />
| 4102.8<br />
| 2867.7<br />
| 1453.0<br />
|-<br />
| Seaquest<br />
| 2894.4<br />
| 822.6<br />
| 1003.0<br />
| 275.8<br />
|-<br />
| Space Invaders<br />
| 1088.9<br />
| 826.3<br />
| 373.2<br />
| 302.0<br />
|}<br />
<br />
Clearly, experience replay and maintaining a secondary network for computing target values are important. From these results, it seems that experience replay is more important on its own, except in Seaquest.<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25670human-level control through deep reinforcement learning2015-10-30T14:50:35Z<p>Alcateri: /* Results with model components removed */</p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
== Training ==<br />
<br />
=== Framework and Additional Setup Details === <br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The goal is to use minimal prior knowledge and perform end-to-end training of these models based on game experience.<br />
<br />
The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. <br />
<br />
The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
=== Algorithm Background ===<br />
<br />
At each step in time, the agent selects an action <math>a_t\,</math> from the set of legal game actions <math>\mathbb{A}</math>. The agent observes an image <math>x_t \in \mathbb{R}^d</math> from the emulator, along with a reward <math>r_t\,</math>. It is impossible to fully understand the current game situation from a single screen, so a sequence of actions and observations <math>s_t = x_1, a_1, x_2, \ldots, a_{t-1}, x_t</math> is the input state. <br />
<br />
Recall that if we define <math>R_t = \sum_{t'=t}^T \gamma^{t'-t}r_t</math>, where <math>\gamma\,</math> is the discount factor and <math>\,T</math> is the step in which the game terminates, then <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[R_t| s_t=s, a_t=a, \pi\right]</math><br />
<br />
is the optimal action-value function. <br />
<br />
==== The Bellman Equation in the Loss Framework ====<br />
<br />
The optimal action-value function obeys the Bellman Equation:<br />
<br />
:<math>Q^*\left(s,a\right) = \mathop{\mathbb{E}_{s'}}\left[r + \gamma \max_{a'}Q^*\left(s',a'\right) | s, a\right] </math><br />
<br />
The intuition behind this identity is as follows: if the optimal value <math>Q^*(s',a')\,</math> at the next time step was known for all possible actions <math>a'\,</math>, then the optimal strategy is to select the action <math>a'</math> maximizing the expected value above <ref name = "main"> </ref>. Using the Bellman Equation as an iterative update formula is impractical, however, since the action-value function is estimated separately for each sequence and cannot generalize.<br />
<br />
It is necessary, in practice, to operate with an approximation of the action-value function. When a neural network with weights <math>\,\theta</math> is used, it is referred to as a Q-Network. A Q-Network is trained by adjusting <math>\,\theta_t</math> to reduce the mean-squared error in the Bellman Equation. The new target values for training are given by <math>y = r + \gamma\max_{a'} Q\left(s', a'; \theta_t^-\right)</math>, where <math>\theta_t^-\,</math> are the parameters from some previous iteration.<br />
<br />
=== The Full Algorithm === <br />
<br />
Now, with all the background behind us, the full Deep Q-Learning with Experience Replay algorithm is presented below:<br />
<br />
<br />
[[File:QLearning_Alg.JPG]]<br />
<br />
<br />
Some notes about the algorithm:<br />
* Replay memory is used to implement the experience replay technique described above<br />
* An episode is one game<br />
* Correlations between target values and the action function <math>Q\,</math>are mitigated by using <math>\hat{Q}</math> for the target values<br />
** Only updated every <math>\,C</math> steps<br />
<br />
== Results ==<br />
<br />
=== Evaluation Procedure === <br />
<br />
The trained networks played each game 30 times, up to 5 minutes at a time. The random agent, which is the baseline comparison, chooses a random action every 6 frames (10 Hz). The human player uses the same emulator as the agents, and played under controlled conditions (most notably without sound). The human performance is the average reward from around 20 episodes of the game lasting up to 5 minutes, after 2 hours of practice playing each game. The human performance is set to be 100%, and the random agent has performance set to 0%.<br />
<br />
=== Raw Score Results ===<br />
<br />
The DQN agent outperforms the best existing reinforcement learning methods on 43 of the games without incorporating prior knowledge about Atari games. Furthermore, the agent scores at least 75% of the human score on more than half of the games. Also, DQN performs well across a number of types of games. However, games which involve extended planning strategies still pose a major problem to DQN (e.g. Montezuma's Revenge). These results are visualized in the figure below: <br />
<br />
[[File:Performance.JPG | center]]<br />
<br />
=== Results with model components removed ===<br />
<br />
Two important advances in this paper were presented: experience replay and creating a separate network to evaluate the targets. To visualize the impact of these advances, the network was trained with and without both of these concepts, and evaluated on its performance in each case. The results are shown in the table below, in percentage form as above:<br />
<br />
{| class="wikitable"<br />
|-<br />
! Game<br />
! With Replay and Target Q<br />
! With Replay, Without Target Q<br />
! Without Replay, With Target Q<br />
! Without Replay, Without Target Q<br />
|-<br />
| Breakout<br />
| 316.8<br />
| 240.7<br />
| 10.2<br />
| 3.2<br />
|-<br />
| Enduro<br />
| 1006.3<br />
| 831.4<br />
| 141.9<br />
| 29.1<br />
|- <br />
| River Raid<br />
| 7446.6<br />
| 4102.8<br />
| 2867.7<br />
| 1453.0<br />
|-<br />
| Seaquest<br />
| 2894.4<br />
| 822.6<br />
| 1003.0<br />
| 275.8<br />
|-<br />
| Space Invaders<br />
| 1088.9<br />
| 826.3<br />
| 373.2<br />
| 302.0<br />
|}<br />
<br />
Clearly, experience replay and maintaining a secondary network for computing target values are important. From these results, it seems that experience replay is more important on its own, except in Seaquest.<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25669human-level control through deep reinforcement learning2015-10-30T12:52:51Z<p>Alcateri: /* Results with model components removed */</p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
== Training ==<br />
<br />
=== Framework and Additional Setup Details === <br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The goal is to use minimal prior knowledge and perform end-to-end training of these models based on game experience.<br />
<br />
The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. <br />
<br />
The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
=== Algorithm Background ===<br />
<br />
At each step in time, the agent selects an action <math>a_t\,</math> from the set of legal game actions <math>\mathbb{A}</math>. The agent observes an image <math>x_t \in \mathbb{R}^d</math> from the emulator, along with a reward <math>r_t\,</math>. It is impossible to fully understand the current game situation from a single screen, so a sequence of actions and observations <math>s_t = x_1, a_1, x_2, \ldots, a_{t-1}, x_t</math> is the input state. <br />
<br />
Recall that if we define <math>R_t = \sum_{t'=t}^T \gamma^{t'-t}r_t</math>, where <math>\gamma\,</math> is the discount factor and <math>\,T</math> is the step in which the game terminates, then <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[R_t| s_t=s, a_t=a, \pi\right]</math><br />
<br />
is the optimal action-value function. <br />
<br />
==== The Bellman Equation in the Loss Framework ====<br />
<br />
The optimal action-value function obeys the Bellman Equation:<br />
<br />
:<math>Q^*\left(s,a\right) = \mathop{\mathbb{E}_{s'}}\left[r + \gamma \max_{a'}Q^*\left(s',a'\right) | s, a\right] </math><br />
<br />
The intuition behind this identity is as follows: if the optimal value <math>Q^*(s',a')\,</math> at the next time step was known for all possible actions <math>a'\,</math>, then the optimal strategy is to select the action <math>a'</math> maximizing the expected value above <ref name = "main"> </ref>. Using the Bellman Equation as an iterative update formula is impractical, however, since the action-value function is estimated separately for each sequence and cannot generalize.<br />
<br />
It is necessary, in practice, to operate with an approximation of the action-value function. When a neural network with weights <math>\,\theta</math> is used, it is referred to as a Q-Network. A Q-Network is trained by adjusting <math>\,\theta_t</math> to reduce the mean-squared error in the Bellman Equation. The new target values for training are given by <math>y = r + \gamma\max_{a'} Q\left(s', a'; \theta_t^-\right)</math>, where <math>\theta_t^-\,</math> are the parameters from some previous iteration.<br />
<br />
=== The Full Algorithm === <br />
<br />
Now, with all the background behind us, the full Deep Q-Learning with Experience Replay algorithm is presented below:<br />
<br />
<br />
[[File:QLearning_Alg.JPG]]<br />
<br />
<br />
Some notes about the algorithm:<br />
* Replay memory is used to implement the experience replay technique described above<br />
* An episode is one game<br />
* Correlations between target values and the action function <math>Q\,</math>are mitigated by using <math>\hat{Q}</math> for the target values<br />
** Only updated every <math>\,C</math> steps<br />
<br />
== Results ==<br />
<br />
=== Evaluation Procedure === <br />
<br />
The trained networks played each game 30 times, up to 5 minutes at a time. The random agent, which is the baseline comparison, chooses a random action every 6 frames (10 Hz). The human player uses the same emulator as the agents, and played under controlled conditions (most notably without sound). The human performance is the average reward from around 20 episodes of the game lasting up to 5 minutes, after 2 hours of practice playing each game. The human performance is set to be 100%, and the random agent has performance set to 0%.<br />
<br />
=== Raw Score Results ===<br />
<br />
The DQN agent outperforms the best existing reinforcement learning methods on 43 of the games without incorporating prior knowledge about Atari games. Furthermore, the agent scores at least 75% of the human score on more than half of the games. Also, DQN performs well across a number of types of games. However, games which involve extended planning strategies still pose a major problem to DQN (e.g. Montezuma's Revenge). These results are visualized in the figure below: <br />
<br />
[[File:Performance.JPG | center]]<br />
<br />
=== Results with model components removed ===<br />
<br />
Two important advances in this paper were presented: experience replay and creating a separate network to evaluate the targets. To visualize the impact of these advances, the network was trained with and without both of these concepts, and evaluated on its performance in each case. The results are shown in the table below:<br />
<br />
{| class="wikitable"<br />
|-<br />
! Game<br />
! With Replay and Target Q<br />
! With Replay, Without Target Q<br />
! Without Replay, With Target Q<br />
! Without Replay, Without Target Q<br />
|-<br />
| Breakout<br />
| 316.8<br />
| 240.7<br />
| 10.2<br />
| 3.2<br />
|-<br />
| Enduro<br />
| 1006.3<br />
| 831.4<br />
| 141.9<br />
| 29.1<br />
|- <br />
| River Raid<br />
| 7446.6<br />
| 4102.8<br />
| 2867.7<br />
| 1453.0<br />
|-<br />
| Seaquest<br />
| 2894.4<br />
| 822.6<br />
| 1003.0<br />
| 275.8<br />
|-<br />
| Space Invaders<br />
| 1088.9<br />
| 826.3<br />
| 373.2<br />
| 302.0<br />
|}<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25668human-level control through deep reinforcement learning2015-10-30T12:49:47Z<p>Alcateri: /* Results with model components removed */</p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
== Training ==<br />
<br />
=== Framework and Additional Setup Details === <br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The goal is to use minimal prior knowledge and perform end-to-end training of these models based on game experience.<br />
<br />
The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. <br />
<br />
The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
=== Algorithm Background ===<br />
<br />
At each step in time, the agent selects an action <math>a_t\,</math> from the set of legal game actions <math>\mathbb{A}</math>. The agent observes an image <math>x_t \in \mathbb{R}^d</math> from the emulator, along with a reward <math>r_t\,</math>. It is impossible to fully understand the current game situation from a single screen, so a sequence of actions and observations <math>s_t = x_1, a_1, x_2, \ldots, a_{t-1}, x_t</math> is the input state. <br />
<br />
Recall that if we define <math>R_t = \sum_{t'=t}^T \gamma^{t'-t}r_t</math>, where <math>\gamma\,</math> is the discount factor and <math>\,T</math> is the step in which the game terminates, then <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[R_t| s_t=s, a_t=a, \pi\right]</math><br />
<br />
is the optimal action-value function. <br />
<br />
==== The Bellman Equation in the Loss Framework ====<br />
<br />
The optimal action-value function obeys the Bellman Equation:<br />
<br />
:<math>Q^*\left(s,a\right) = \mathop{\mathbb{E}_{s'}}\left[r + \gamma \max_{a'}Q^*\left(s',a'\right) | s, a\right] </math><br />
<br />
The intuition behind this identity is as follows: if the optimal value <math>Q^*(s',a')\,</math> at the next time step was known for all possible actions <math>a'\,</math>, then the optimal strategy is to select the action <math>a'</math> maximizing the expected value above <ref name = "main"> </ref>. Using the Bellman Equation as an iterative update formula is impractical, however, since the action-value function is estimated separately for each sequence and cannot generalize.<br />
<br />
It is necessary, in practice, to operate with an approximation of the action-value function. When a neural network with weights <math>\,\theta</math> is used, it is referred to as a Q-Network. A Q-Network is trained by adjusting <math>\,\theta_t</math> to reduce the mean-squared error in the Bellman Equation. The new target values for training are given by <math>y = r + \gamma\max_{a'} Q\left(s', a'; \theta_t^-\right)</math>, where <math>\theta_t^-\,</math> are the parameters from some previous iteration.<br />
<br />
=== The Full Algorithm === <br />
<br />
Now, with all the background behind us, the full Deep Q-Learning with Experience Replay algorithm is presented below:<br />
<br />
<br />
[[File:QLearning_Alg.JPG]]<br />
<br />
<br />
Some notes about the algorithm:<br />
* Replay memory is used to implement the experience replay technique described above<br />
* An episode is one game<br />
* Correlations between target values and the action function <math>Q\,</math>are mitigated by using <math>\hat{Q}</math> for the target values<br />
** Only updated every <math>\,C</math> steps<br />
<br />
== Results ==<br />
<br />
=== Evaluation Procedure === <br />
<br />
The trained networks played each game 30 times, up to 5 minutes at a time. The random agent, which is the baseline comparison, chooses a random action every 6 frames (10 Hz). The human player uses the same emulator as the agents, and played under controlled conditions (most notably without sound). The human performance is the average reward from around 20 episodes of the game lasting up to 5 minutes, after 2 hours of practice playing each game. The human performance is set to be 100%, and the random agent has performance set to 0%.<br />
<br />
=== Raw Score Results ===<br />
<br />
The DQN agent outperforms the best existing reinforcement learning methods on 43 of the games without incorporating prior knowledge about Atari games. Furthermore, the agent scores at least 75% of the human score on more than half of the games. Also, DQN performs well across a number of types of games. However, games which involve extended planning strategies still pose a major problem to DQN (e.g. Montezuma's Revenge). These results are visualized in the figure below: <br />
<br />
[[File:Performance.JPG | center]]<br />
<br />
=== Results with model components removed ===<br />
<br />
Two important advances in this paper were presented: experience replay and creating a separate network to evaluate the targets. To visualize the impact of these advances, the network was trained with and without both of these concepts, and evaluated on its performance in each case. The results are shown in the table below:<br />
<br />
{| class="wikitable"<br />
|-<br />
! Game<br />
! With Replay and Target Q<br />
! With Replay, Without Target Q<br />
! Without Replay, With Target Q<br />
! Without Replay, Without Target Q<br />
|-<br />
| 316.8<br />
| 1006.3<br />
| 7446.6<br />
| 2894.4<br />
| 1088.9<br />
|-<br />
| 240.7<br />
| 831.4<br />
| 4102.8<br />
| 822.6<br />
| 826.3<br />
|-<br />
| 10.2<br />
| 141.9<br />
| 2867.7<br />
| 1003.0<br />
| 373.2<br />
|-<br />
| 3.2<br />
| 29.1<br />
| 1453.0<br />
| 275.8<br />
| 302.0<br />
|}<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25667human-level control through deep reinforcement learning2015-10-30T12:48:39Z<p>Alcateri: </p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
== Training ==<br />
<br />
=== Framework and Additional Setup Details === <br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The goal is to use minimal prior knowledge and perform end-to-end training of these models based on game experience.<br />
<br />
The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. <br />
<br />
The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
=== Algorithm Background ===<br />
<br />
At each step in time, the agent selects an action <math>a_t\,</math> from the set of legal game actions <math>\mathbb{A}</math>. The agent observes an image <math>x_t \in \mathbb{R}^d</math> from the emulator, along with a reward <math>r_t\,</math>. It is impossible to fully understand the current game situation from a single screen, so a sequence of actions and observations <math>s_t = x_1, a_1, x_2, \ldots, a_{t-1}, x_t</math> is the input state. <br />
<br />
Recall that if we define <math>R_t = \sum_{t'=t}^T \gamma^{t'-t}r_t</math>, where <math>\gamma\,</math> is the discount factor and <math>\,T</math> is the step in which the game terminates, then <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[R_t| s_t=s, a_t=a, \pi\right]</math><br />
<br />
is the optimal action-value function. <br />
<br />
==== The Bellman Equation in the Loss Framework ====<br />
<br />
The optimal action-value function obeys the Bellman Equation:<br />
<br />
:<math>Q^*\left(s,a\right) = \mathop{\mathbb{E}_{s'}}\left[r + \gamma \max_{a'}Q^*\left(s',a'\right) | s, a\right] </math><br />
<br />
The intuition behind this identity is as follows: if the optimal value <math>Q^*(s',a')\,</math> at the next time step was known for all possible actions <math>a'\,</math>, then the optimal strategy is to select the action <math>a'</math> maximizing the expected value above <ref name = "main"> </ref>. Using the Bellman Equation as an iterative update formula is impractical, however, since the action-value function is estimated separately for each sequence and cannot generalize.<br />
<br />
It is necessary, in practice, to operate with an approximation of the action-value function. When a neural network with weights <math>\,\theta</math> is used, it is referred to as a Q-Network. A Q-Network is trained by adjusting <math>\,\theta_t</math> to reduce the mean-squared error in the Bellman Equation. The new target values for training are given by <math>y = r + \gamma\max_{a'} Q\left(s', a'; \theta_t^-\right)</math>, where <math>\theta_t^-\,</math> are the parameters from some previous iteration.<br />
<br />
=== The Full Algorithm === <br />
<br />
Now, with all the background behind us, the full Deep Q-Learning with Experience Replay algorithm is presented below:<br />
<br />
<br />
[[File:QLearning_Alg.JPG]]<br />
<br />
<br />
Some notes about the algorithm:<br />
* Replay memory is used to implement the experience replay technique described above<br />
* An episode is one game<br />
* Correlations between target values and the action function <math>Q\,</math>are mitigated by using <math>\hat{Q}</math> for the target values<br />
** Only updated every <math>\,C</math> steps<br />
<br />
== Results ==<br />
<br />
=== Evaluation Procedure === <br />
<br />
The trained networks played each game 30 times, up to 5 minutes at a time. The random agent, which is the baseline comparison, chooses a random action every 6 frames (10 Hz). The human player uses the same emulator as the agents, and played under controlled conditions (most notably without sound). The human performance is the average reward from around 20 episodes of the game lasting up to 5 minutes, after 2 hours of practice playing each game. The human performance is set to be 100%, and the random agent has performance set to 0%.<br />
<br />
=== Raw Score Results ===<br />
<br />
The DQN agent outperforms the best existing reinforcement learning methods on 43 of the games without incorporating prior knowledge about Atari games. Furthermore, the agent scores at least 75% of the human score on more than half of the games. Also, DQN performs well across a number of types of games. However, games which involve extended planning strategies still pose a major problem to DQN (e.g. Montezuma's Revenge). These results are visualized in the figure below: <br />
<br />
[[File:Performance.JPG | center]]<br />
<br />
=== Results with model components removed ===<br />
<br />
Two important advances in this paper were presented: experience replay and creating a separate network to evaluate the targets. To visualize the impact of these advances, the network was trained with and without both of these concepts, and evaluated on its performance in each case. The results are shown in the table below:<br />
<br />
{| class="wikitable"<br />
|-<br />
! Game<br />
! With Replay and Target Q<br />
! With Replay, Without Target Q<br />
! Without Replay, With Target Q<br />
! Without Replay, Without Target Q<br />
|-<br />
| 316.8<br />
| 1006.3<br />
| 7446.6<br />
| 2894.4<br />
| 1088.9<br />
|-<br />
| 240.7<br />
| 831.4<br />
| 4102.8<br />
| 822.6<br />
| 826.3<br />
| -<br />
| 10.2<br />
| 141.9<br />
| 2867.7<br />
| 1003.0<br />
| 373.2<br />
| -<br />
| 3.2<br />
| 29.1<br />
| 1453.0<br />
| 275.8<br />
| 302.0<br />
|}<br />
<br />
<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25655human-level control through deep reinforcement learning2015-10-30T04:04:00Z<p>Alcateri: </p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
== Training ==<br />
<br />
=== Framework and Additional Setup Details === <br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The goal is to use minimal prior knowledge and perform end-to-end training of these models based on game experience.<br />
<br />
The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. <br />
<br />
The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
=== Algorithm Background ===<br />
<br />
At each step in time, the agent selects an action <math>a_t\,</math> from the set of legal game actions <math>\mathbb{A}</math>. The agent observes an image <math>x_t \in \mathbb{R}^d</math> from the emulator, along with a reward <math>r_t\,</math>. It is impossible to fully understand the current game situation from a single screen, so a sequence of actions and observations <math>s_t = x_1, a_1, x_2, \ldots, a_{t-1}, x_t</math> is the input state. <br />
<br />
Recall that if we define <math>R_t = \sum_{t'=t}^T \gamma^{t'-t}r_t</math>, where <math>\gamma\,</math> is the discount factor and <math>\,T</math> is the step in which the game terminates, then <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[R_t| s_t=s, a_t=a, \pi\right]</math><br />
<br />
is the optimal action-value function. <br />
<br />
==== The Bellman Equation in the Loss Framework ====<br />
<br />
The optimal action-value function obeys the Bellman Equation:<br />
<br />
:<math>Q^*\left(s,a\right) = \mathop{\mathbb{E}_{s'}}\left[r + \gamma \max_{a'}Q^*\left(s',a'\right) | s, a\right] </math><br />
<br />
The intuition behind this identity is as follows: if the optimal value <math>Q^*(s',a')\,</math> at the next time step was known for all possible actions <math>a'\,</math>, then the optimal strategy is to select the action <math>a'</math> maximizing the expected value above <ref name = "main"> </ref>. Using the Bellman Equation as an iterative update formula is impractical, however, since the action-value function is estimated separately for each sequence and cannot generalize.<br />
<br />
It is necessary, in practice, to operate with an approximation of the action-value function. When a neural network with weights <math>\,\theta</math> is used, it is referred to as a Q-Network. A Q-Network is trained by adjusting <math>\,\theta_t</math> to reduce the mean-squared error in the Bellman Equation. The new target values for training are given by <math>y = r + \gamma\max_{a'} Q\left(s', a'; \theta_t^-\right)</math>, where <math>\theta_t^-\,</math> are the parameters from some previous iteration.<br />
<br />
=== The Full Algorithm === <br />
<br />
Now, with all the background behind us, the full Deep Q-Learning with Experience Replay algorithm is presented below:<br />
<br />
<br />
[[File:QLearning_Alg.JPG]]<br />
<br />
<br />
Some notes about the algorithm:<br />
* Replay memory is used to implement the experience replay technique described above<br />
* An episode is one game<br />
* Correlations between target values and the action function <math>Q\,</math>are mitigated by using <math>\hat{Q}</math> for the target values<br />
** Only updated every <math>\,C</math> steps<br />
<br />
== Results ==<br />
<br />
=== Evaluation Procedure === <br />
<br />
The trained networks played each game 30 times, up to 5 minutes at a time. The random agent, which is the baseline comparison, chooses a random action every 6 frames (10 Hz). The human player uses the same emulator as the agents, and played under controlled conditions (most notably without sound). The human performance is the average reward from around 20 episodes of the game lasting up to 5 minutes, after 2 hours of practice playing each game. The human performance is set to be 100%, and the random agent has performance set to 0%.<br />
<br />
<br />
The DQN agent outperforms the best existing reinforcement learning methods on 43 of the games without incorporating prior knowledge about Atari games. Furthermore, the agent scores at least 75% of the human score on more than half of the games. Also, DQN performs well across a number of types of games. However, games which involve extended planning strategies still pose a major problem to DQN (e.g. Montezuma's Revenge). These results are visualized in the figure below: <br />
<br />
[[File:Performance.JPG | center]]<br />
<br />
<br />
<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Performance.JPG&diff=25653File:Performance.JPG2015-10-30T04:00:23Z<p>Alcateri: Performance of DQN vs. Human Tester and Existing Results</p>
<hr />
<div>Performance of DQN vs. Human Tester and Existing Results</div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25652human-level control through deep reinforcement learning2015-10-30T03:59:51Z<p>Alcateri: </p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
== Training ==<br />
<br />
=== Framework and Additional Setup Details === <br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The goal is to use minimal prior knowledge and perform end-to-end training of these models based on game experience.<br />
<br />
The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. <br />
<br />
The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
=== Algorithm Background ===<br />
<br />
At each step in time, the agent selects an action <math>a_t\,</math> from the set of legal game actions <math>\mathbb{A}</math>. The agent observes an image <math>x_t \in \mathbb{R}^d</math> from the emulator, along with a reward <math>r_t\,</math>. It is impossible to fully understand the current game situation from a single screen, so a sequence of actions and observations <math>s_t = x_1, a_1, x_2, \ldots, a_{t-1}, x_t</math> is the input state. <br />
<br />
Recall that if we define <math>R_t = \sum_{t'=t}^T \gamma^{t'-t}r_t</math>, where <math>\gamma\,</math> is the discount factor and <math>\,T</math> is the step in which the game terminates, then <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[R_t| s_t=s, a_t=a, \pi\right]</math><br />
<br />
is the optimal action-value function. <br />
<br />
==== The Bellman Equation in the Loss Framework ====<br />
<br />
The optimal action-value function obeys the Bellman Equation:<br />
<br />
:<math>Q^*\left(s,a\right) = \mathop{\mathbb{E}_{s'}}\left[r + \gamma \max_{a'}Q^*\left(s',a'\right) | s, a\right] </math><br />
<br />
The intuition behind this identity is as follows: if the optimal value <math>Q^*(s',a')\,</math> at the next time step was known for all possible actions <math>a'\,</math>, then the optimal strategy is to select the action <math>a'</math> maximizing the expected value above <ref name = "main"> </ref>. Using the Bellman Equation as an iterative update formula is impractical, however, since the action-value function is estimated separately for each sequence and cannot generalize.<br />
<br />
It is necessary, in practice, to operate with an approximation of the action-value function. When a neural network with weights <math>\,\theta</math> is used, it is referred to as a Q-Network. A Q-Network is trained by adjusting <math>\,\theta_t</math> to reduce the mean-squared error in the Bellman Equation. The new target values for training are given by <math>y = r + \gamma\max_{a'} Q\left(s', a'; \theta_t^-\right)</math>, where <math>\theta_t^-\,</math> are the parameters from some previous iteration.<br />
<br />
=== The Full Algorithm === <br />
<br />
Now, with all the background behind us, the full Deep Q-Learning with Experience Replay algorithm is presented below:<br />
<br />
<br />
[[File:QLearning_Alg.JPG]]<br />
<br />
<br />
Some notes about the algorithm:<br />
* Replay memory is used to implement the experience replay technique described above<br />
* An episode is one game<br />
* Correlations between target values and the action function <math>Q\,</math>are mitigated by using <math>\hat{Q}</math> for the target values<br />
** Only updated every <math>\,C</math> steps<br />
<br />
== Results ==<br />
<br />
=== Evaluation Procedure === <br />
<br />
The trained networks played each game 30 times, up to 5 minutes at a time. The random agent, which is the baseline comparison, chooses a random action every 6 frames (10 Hz). The human player uses the same emulator as the agents, and played under controlled conditions (most notably without sound). The human performance is the average reward from around 20 episodes of the game lasting up to 5 minutes, after 2 hours of practice playing each game. The human performance is set to be 100%, and the random agent has performance set to 0%.<br />
<br />
<br />
The DQN agent outperforms the best existing reinforcement learning methods on 43 of the games without incorporating prior knowledge about Atari games. Furthermore, the agent scores at least 75% of the human score on more than half of the games. These results are visualized in the figure below: <br />
<br />
<br />
<br />
<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25644human-level control through deep reinforcement learning2015-10-30T03:47:58Z<p>Alcateri: </p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
== Training ==<br />
<br />
=== Framework and Additional Setup Details === <br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The goal is to use minimal prior knowledge and perform end-to-end training of these models based on game experience.<br />
<br />
The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. <br />
<br />
The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
=== Algorithm Background ===<br />
<br />
At each step in time, the agent selects an action <math>a_t\,</math> from the set of legal game actions <math>\mathbb{A}</math>. The agent observes an image <math>x_t \in \mathbb{R}^d</math> from the emulator, along with a reward <math>r_t\,</math>. It is impossible to fully understand the current game situation from a single screen, so a sequence of actions and observations <math>s_t = x_1, a_1, x_2, \ldots, a_{t-1}, x_t</math> is the input state. <br />
<br />
Recall that if we define <math>R_t = \sum_{t'=t}^T \gamma^{t'-t}r_t</math>, where <math>\gamma\,</math> is the discount factor and <math>\,T</math> is the step in which the game terminates, then <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[R_t| s_t=s, a_t=a, \pi\right]</math><br />
<br />
is the optimal action-value function. <br />
<br />
==== The Bellman Equation in the Loss Framework ====<br />
<br />
The optimal action-value function obeys the Bellman Equation:<br />
<br />
:<math>Q^*\left(s,a\right) = \mathop{\mathbb{E}_{s'}}\left[r + \gamma \max_{a'}Q^*\left(s',a'\right) | s, a\right] </math><br />
<br />
The intuition behind this identity is as follows: if the optimal value <math>Q^*(s',a')\,</math> at the next time step was known for all possible actions <math>a'\,</math>, then the optimal strategy is to select the action <math>a'</math> maximizing the expected value above <ref name = "main"> </ref>. Using the Bellman Equation as an iterative update formula is impractical, however, since the action-value function is estimated separately for each sequence and cannot generalize.<br />
<br />
It is necessary, in practice, to operate with an approximation of the action-value function. When a neural network with weights <math>\,\theta</math> is used, it is referred to as a Q-Network. A Q-Network is trained by adjusting <math>\,\theta_t</math> to reduce the mean-squared error in the Bellman Equation. The new target values for training are given by <math>y = r + \gamma\max_{a'} Q\left(s', a'; \theta_t^-\right)</math>, where <math>\theta_t^-\,</math> are the parameters from some previous iteration.<br />
<br />
=== The Full Algorithm === <br />
<br />
Now, with all the background behind us, the full Deep Q-Learning with Experience Replay algorithm is presented below:<br />
<br />
<br />
[[File:QLearning_Alg.JPG]]<br />
<br />
<br />
Some notes about the algorithm:<br />
* Replay memory is used to implement the experience replay technique described above<br />
* An episode is one game<br />
* Correlations between target values and the action function <math>Q\,</math>are mitigated by using <math>\hat{Q}</math> for the target values<br />
** Only updated every <math>\,C</math> steps<br />
<br />
== Results ==<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:QLearning_Alg.JPG&diff=25642File:QLearning Alg.JPG2015-10-30T03:11:49Z<p>Alcateri: The algorithm for deep Q-learning with Experience Replay</p>
<hr />
<div>The algorithm for deep Q-learning with Experience Replay</div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25641human-level control through deep reinforcement learning2015-10-30T03:10:41Z<p>Alcateri: </p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
== Training ==<br />
<br />
=== Framework and Additional Setup Details === <br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The goal is to use minimal prior knowledge and perform end-to-end training of these models based on game experience.<br />
<br />
The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. <br />
<br />
The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
=== Algorithm Background ===<br />
<br />
At each step in time, the agent selects an action <math>a_t\,</math> from the set of legal game actions <math>\mathbb{A}</math>. The agent observes an image <math>x_t \in \mathbb{R}^d</math> from the emulator, along with a reward <math>r_t\,</math>. It is impossible to fully understand the current game situation from a single screen, so a sequence of actions and observations <math>s_t = x_1, a_1, x_2, \ldots, a_{t-1}, x_t</math> is the input state. <br />
<br />
Recall that if we define <math>R_t = \sum_{t'=t}^T \gamma^{t'-t}r_t</math>, where <math>\gamma\,</math> is the discount factor and <math>\,T</math> is the step in which the game terminates, then <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[R_t| s_t=s, a_t=a, \pi\right]</math><br />
<br />
is the optimal action-value function. <br />
<br />
==== The Bellman Equation in the Loss Framework ====<br />
<br />
The optimal action-value function obeys the Bellman Equation:<br />
<br />
:<math>Q^*\left(s,a\right) = \mathop{\mathbb{E}_{s'}}\left[r + \gamma \max_{a'}Q^*\left(s',a'\right) | s, a\right] </math><br />
<br />
The intuition behind this identity is as follows: if the optimal value <math>Q^*(s',a')\,</math> at the next time step was known for all possible actions <math>a'\,</math>, then the optimal strategy is to select the action <math>a'</math> maximizing the expected value above <ref name = "main"> </ref>. Using the Bellman Equation as an iterative update formula is impractical, however, since the action-value function is estimated separately for each sequence and cannot generalize.<br />
<br />
It is necessary, in practice, to operate with an approximation of the action-value function. When a neural network with weights <math>\,\theta</math> is used, it is referred to as a Q-Network. A Q-Network is trained by adjusting <math>\,\theta_t</math> to reduce the mean-squared error in the Bellman Equation. The new target values for training are given by <math>y = r + \gamma\max_{a'} Q\left(s', a'; \theta_t^-\right)</math>, where <math>\theta_t^-\,</math> are the parameters from some previous iteration.<br />
<br />
=== The Full Algorithm === <br />
<br />
Now, with all the background behind us, the full Deep Q-Learning with Experience Replay algorithm is presented below:<br />
<br />
<br />
<br />
== Results ==<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25640human-level control through deep reinforcement learning2015-10-30T02:36:10Z<p>Alcateri: /* Training */</p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
== Training ==<br />
<br />
=== Framework and Additional Setup Details === <br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The goal is to use minimal prior knowledge and perform end-to-end training of these models based on game experience.<br />
<br />
The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. <br />
<br />
The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
=== Algorithm Background ===<br />
<br />
At each step in time, the agent selects an action <math>a_t\,</math> from the set of legal game actions <math>\mathbb{A}</math>. The agent observes an image <math>x_t \in \mathbb{R}^d</math> from the emulator, along with a reward <math>r_t\,</math>. It is impossible to fully understand the current game situation from a single screen, so a sequence of actions and observations <math>s_t = x_1, a_1, x_2, \ldots, a_{t-1}, x_t</math> is the input state. <br />
<br />
Recall that if we define <math>R_t = \sum_{t'=t}^T \gamma^{t'-t}r_t</math>, where <math>\gamma\,</math> is the discount factor and <math>\,T</math> is the step in which the game terminates, then<br />
<br />
== Results ==<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25639human-level control through deep reinforcement learning2015-10-30T01:51:28Z<p>Alcateri: </p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
== Training ==<br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The goal is to use minimal prior knowledge and perform end-to-end training of these models based on game experience.<br />
<br />
=== Details ===<br />
<br />
The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. <br />
<br />
The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
=== Training Algorithm === <br />
<br />
<br />
<br />
== Results ==<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25638human-level control through deep reinforcement learning2015-10-30T01:49:01Z<p>Alcateri: /* Training Details */</p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
=== Training Details ===<br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. <br />
<br />
The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
== Results ==<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25637human-level control through deep reinforcement learning2015-10-30T01:48:16Z<p>Alcateri: /* Training Details */</p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
=== Training Details ===<br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
== Results ==<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25636human-level control through deep reinforcement learning2015-10-30T01:47:32Z<p>Alcateri: /* Training Details */</p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
=== Training Details ===<br />
<br />
Forty-nine Atari games were considered as experiments. A different DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
== Results ==<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25635human-level control through deep reinforcement learning2015-10-30T01:31:08Z<p>Alcateri: </p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
=== Training Details ===<br />
<br />
Forty-nine Atari games were considered as experiments. A different DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train, with a batch size of 32.<br />
<br />
== Results ==<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25634human-level control through deep reinforcement learning2015-10-30T01:29:29Z<p>Alcateri: </p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Methodology ==<br />
<br />
=== Problem Description ===<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
==== Instability of Neural Networks as Function Estimate ====<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
==== Overcoming Instability ====<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
=== Model Architecture ===<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
=== Training Details ===<br />
<br />
Forty-nine Atari games were considered as experiments. A different DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train, with a batch size of 32.<br />
<br />
== Results ==<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Network_Architecture.JPG&diff=25633File:Network Architecture.JPG2015-10-30T01:02:07Z<p>Alcateri: A visualization of the deep convolutional network that estimates Q</p>
<hr />
<div>A visualization of the deep convolutional network that estimates Q</div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25625human-level control through deep reinforcement learning2015-10-29T23:22:18Z<p>Alcateri: </p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Methodology ==<br />
<br />
=== Problem Description ===<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
==== Instability of Neural Networks as Function Estimate ====<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
==== Overcoming Instability ====<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84x84x4 images are the inputs to the network. <br />
<br />
=== Model Architecture ===<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards (<math>Q\,</math>) is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
=== Training Details ===<br />
<br />
Forty-nine Atari games were considered as experiments.<br />
<br />
== Results ==<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25623human-level control through deep reinforcement learning2015-10-29T22:42:09Z<p>Alcateri: </p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Methodology ==<br />
<br />
=== Problem Description ===<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
==== Instability of Neural Networks as Function Estimate ====<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
==== Overcoming Instability ====<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these are the inputs to the network. <br />
<br />
=== Model Architecture ===<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards (<math>Q\,</math>) is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state.<br />
<br />
== Results ==<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25622human-level control through deep reinforcement learning2015-10-29T22:41:22Z<p>Alcateri: /* Methodology */</p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Methodology ==<br />
<br />
=== Problem Description ===<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
==== Instability of Neural Networks as Function Estimate ====<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
==== Overcoming Instability ====<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these are the inputs to the network. <br />
<br />
=== Model Architecture ===<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards (<math>Q\,</math>) is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state.<br />
<br />
== Results ==<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25608human-level control through deep reinforcement learning2015-10-29T16:34:24Z<p>Alcateri: Created page with "== Introduction == Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>..."</p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Methodology ==<br />
<br />
=== Problem Description ===<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>. <br />
<br />
==== Instability of Neural Networks as Function Estimate ====<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
==== Overcoming Instability ====<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
<br />
== Results ==<br />
<br />
== Bibliography ==<br />
<references /></div>Alcaterihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=parsing_natural_scenes_and_natural_language_with_recursive_neural_networks&diff=25552parsing natural scenes and natural language with recursive neural networks2015-10-23T16:15:07Z<p>Alcateri: /* Recursive Neural Networks for Structure Prediction */ Improved readability of math expressions by inserting "\,"</p>
<hr />
<div>= Introduction = <br />
<br />
<br />
This paper uses Recursive Neural Networks (RNN) to find a recursive structure that is commonly found in the inputs of different modalities such as natural scene images or natural language sentences. This is the first deep learning work which learns full scene segmentation, annotation and classification. The same algorithm can be used both to provide a competitive syntactic parser for natural language sentences from the Penn Treebank and to outperform alternative approaches for semantic scene segmentation, annotation and classification. <br />
<br />
For vision applications, the approach differs from any previous works in that it uses off-the-shelf vision features of segments obtained from oversegmented images instead of learning feature from raw images. In addition, the same network can be used recursively to achieve classification instead of building a hierarchy by a convolutional neural network.<br />
<br />
Also, this particular approach for NLP is different in that it handles variable sized sentences in a natural way and captures the recursive nature of natural language. Furthermore, it jointly learns parsing decisions, categories for each phrase and phrase feature embeddings which capture the semantics of their constituents.<br />
<br />
= Core Idea =<br />
<br />
The following figure describes the recursive structure that is present in the images and the sentences.<br />
<br />
<center><br />
[[File:Pic1.png | frame | center |Fig 1. Illustration of the RNN Parsing Images and Text ]]<br />
</center><br />
<br />
Images are first over segmented into regions which are later mapped to semantic feature vector using a neural network. These features are then used as an input to the RNN, which decides whether or not to merge the neighbouring images. This is decided based on a score which is higher if the neighbouring images share the same class label.<br />
<br />
In total the RNN computed 3 outputs : <br />
* Score, indicating whether the neighboring regions should be merged or not<br />
* A new semantic feature representation for this larger region<br />
* Class label<br />
<br />
The same procedure is applied to parsing of words too. The semantic features are given as an input to the RNN, then they are merged into phrases in a syntactically and semantically meaningful order.<br />
<br />
= Input Representation =<br />
<br />
Each image is divided into 78 segments, and 119 Features(described by Gould et al.<ref><br />
Gould, Stephen, Richard Fulton, and Daphne Koller. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5459211&tag=1 "Decomposing a scene into geometric and semantically consistent regions."] Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009.<br />
</ref>) from each segment are extracted. These features include color and<br />
texture features , boosted pixel classifier scores (trained on the labelled training data), as well as appearance and shape features. <br />
<br />
Each of these images are then transformed semantically by applying it to a neural network layer using a logistic function as the activation unit as follows:<br />
<br />
''<math>\,a_i=f(W^{sem}F_i + b^{sem})</math>''<br />
<br />
where W is the weight that we want to learn, F is the Feature Vector, b is the bias and f is the activation function. In this version of experiments, the original sigmoid function <math>f(x)=\tfrac{1}{1 + e^{-x}}</math> was used.<br />
<br />
For the sentences each word is represented by an <math>n</math>-dimensional vector (n=100 in the paper). The values of these vectors are learned to capture co-occurrence statistics and they are stored as columns in a matrix <math>L \in \mathbb{R}^{n \times |V|}</math> where <math>|V|\,</math> is the size of the vocabulary (i.e., the total number of unique words that might occur). To extract the features or semantic representation of a word a binary vector <math>\,e_k</math> with all zeros except for the <math>\,k^{th}</math> index can be used, where <math>\,k</math> corresponds to the word's column index in <math>\,L</math>. Given this vector, the semantic representation of the word is obtained by<br />
<br />
''<math>a_i=Le_k\,</math>''.<br />
<br />
= Recursive Neural Networks for Structure Prediction =<br />
<br />
In our discriminative parsing architecture, the goal is to learn a function ''f : X → Y,'' where Y is the set of all possible binary parse trees. <br />
An input x consists of two parts: (i) A set of activation vectors <math>\{a_1 , . . . , a_{N_{segs}}\}</math>, which represent input elements such as image segments or words of a sentence. (ii) A symmetric adjacency matrix ''A'', where ''A(i, j) = 1,'' if segment i neighbors j. This matrix defines which elements can be merged. For sentences, this matrix has a special form with 1’s only on the first diagonal below and above the main diagonal.<br />
<br />
The following figure illustrates how the inputs to RNN look like and what the correct label is. For images, there will be more than one correct binary parse tree, but for sentences there will only have one correct tree. A correct tree means that segments belong to the same class are merged together into one superclass before they get merged with other segments from different superclasses.<br />
<center><br />
[[File:pic2.png | frame | center |Fig 2. Illustration of the RNN Training Inputs]]<br />
</center><br />
<br />
The structural loss margin for RNN to predict the tree is defined as follows<br />
<br />
<math>\Delta(x,l,y^{\mathrm{proposed}})=\kappa \sum_{d \in N(y^{\mathrm{proposed}})} 1\{subTree(d) \notin Y (x, l)\}</math><br />
<br />
where the summation is over all non terminal nodes and <math>\Kappa</math> is a parameter. ''Y(x,l'') is the set of correct trees corresponding to input ''x'' and label ''l''. <math>1\{\dots\}</math> will be one for a non-empty set and zero for an empty set. To express this in somewhat more natural terms, any subtree that does not occur in any of the ground truth trees will increase the loss by one.<br />
<br />
Given the training set, the algorithm will search for a function f with small expected loss on unseen inputs, i.e. <br />
[[File:pic3.png]]<br />
where θ are all the parameters needed to compute a score s with an RNN. The score of a tree y is high if the algorithm is confident that the structure of the tree is correct.<br />
<br />
An additional constraint imposed is that the score of the highest scoring tree should be greater than margin defined y the structural loss function so that the model output's as high score as possible on the correct tree and as low score as possible on the wrong tree.<br />
This constraint can be expressed as <br />
[[File:pic4.png]]<br />
<br />
With these constraints minimizing the following objective function ''maximizes'' the correct tree’s score and minimizes (up to a margin) the score of the highest scoring but incorrect tree. [[File:pic5.png]]<br />
<br />
For learning the RNN structure, the authors used activation vectors and<br />
adjacency matrix as inputs, as well as a greedy approximation since there is no<br />
efficient dynamic programming algorithms for their RNN setting.<br />
<br />
With an adjacency matrix A, neighboring segments are found with the algorithm<br />
and their activations added to a set of potential child node pairs:<br />
<br />
::::<math>\,C = \{ [a_i, a_j]: A(i, j) = 1 \}</math><br />
<br />
So for example, from the image in Fig 2. we would have the following pairs: <br />
<br />
::::<math>\,C = \{[a_1, a_2], [a_1, a_3], [a_2, a_1], [a_2, a_4], [a_3, a_1], [a_3, a_4], [a_4, a_2], [a_4, a_3], [a_4, a_5], [a_5, a_4]\}</math><br />
<br />
Where these are concatenated and given as inputs into the neural network. Potential parent representations for possible child nodes are calculated with:<br />
<br />
::::<math>\,p(i, j) = f(W[c_i: c_j] + b)</math><br />
<br />
<br />
And the local score with:<br />
<br />
::::<math>\,s(i, j) = W^{score} p(i, j) </math><br />
<br />
Once the scores for all pairs are calculated, three steps are performed:<br />
<br />
1. The highest scoring pair ''<math>\,[a_i, a_j]</math>'' will be removed from the set of potential child node pairs ''<math>\,C</math>''. As well as any other pair containing either ''<math>\,a_i</math>'' or ''<math>\,a_j</math>''.<br />
<br />
2. Adjacency Matrix ''<math>\,A</math>'' is updated with a new row and column that reflects new segment along with its child segments.<br />
<br />
3. Potential new child pairs are added to ''<math>\,C</math>''.<br />
<br />
Steps 1-3 are repeated until all pairs are merged and only one parent activation is left in the set ''<math>\,C</math>''. The last remaining activation is at the root of the Recursive Neural Network that represents the whole image. <br />
<br />
The equation that determines the quality of the structure amongst other variants is simply the sum of all the local decisions:<br />
<br />
::::<math>s(RNN(\theta,x_i,\widehat y))=\sum_{d\in N(\widehat y)}s_d</math><br />
<br />
=== Unsupervised Recursive Autoencoer for structure Prediction<ref>http://nlp.stanford.edu/pubs/SocherPenningtonHuangNgManning_EMNLP2011.pdf</ref>. ===<br />
Instead of using scores (as described above) to predict the tree structure, we can also use the reconstruction error to predict the structure. <br />
<br />
How do we find reconstruction error?<br />
<br />
1. p is the parent representation for children <math>\,[c_1;c_2]</math> (same as before)<br />
<math>\, p=f(W^{(1)}[c_1;c_2]+b^{(1)}) </math><br />
<br />
2. one way of assessing how well this p represents its children is to reconstruct the children in a reconstruction layer. <br />
<math>[c_1^';c_2^']=W^{(2)}p+b^{(2)} </math><br />
3. Then, the reconstruction is defined below. The goal is to minimize <math>\,E_{rec}([c_1];c_2)</math>.<br />
<math>E_{rec}([c_1;c_2])=\frac{1}{2}||[c_1;c_2]-[c_1^';c_2^']||^2 </math><br />
<br />
How to construct the tree?<br />
<br />
* It first takes the first pair of neighboring vecotrs <math>\, (c_1;c_2)=(x_1;x_2) </math>. We save the parent node and the resulting reconstruction error. The network shifted by one position and takes as input vectors <math> \,(c_1;c_2)=(x_2;x_3) </math> and obtains <math> \,p,\, E_{rec}</math>. Repeat the process until it hits the last pair. <br />
* Select the pair with lowest <math>\,E_{rec}</math>. <br />
<br />
eg. Given sequence <math>\,(x_1,x_2,x_3,x_4)</math>, we get lowest <math>\,E_{rec}</math> by the pair <math>\,(x_3, x_4)</math>. The new sequence then consists of <math>\,(x_1, x_2 , p(3,4))</math>.<br />
<br />
* The process repeats and treats the new vector <math>\,p(3,4)</math> like any other vector.<br />
* The process stops until it reaches a deterministic choice of collapsing the remaining two states into one parent. The tree is then recovered.<br />
<br />
= Learning =<br />
The objective function ''J'' is not differentiable due to hinge loss. Therefore, we must opt for the subgradient method (Ratliff et al., 2007) which computes a gradient-like method called the subgradient. Let <math>\theta = (W^{sem},W,W^{score},W^{label})</math>, then the gradient becomes:<br />
<br />
:<math><br />
\frac{\partial J}{\partial \theta} = \frac{1}{x} \sum_{i}\frac{\partial s(\hat{y}_i)}{\partial \theta} - \frac{\partial s(\hat{y}_i)}{\partial \theta} + \lambda\theta,<br />
</math><br />
<br />
where <math>s(\hat{y}_i) = s(RNN(\theta,x_i,\hat{y}_{max(\tau(x_i))}))</math> and <math>s(\hat{y}_i) = s(RNN(\theta,x_i,\hat{y}_{max(Y(x_i,l_i))}))</math>. In order to compute this gradient, we calculate the derivative by using backpropagation through structure (Goller & Küchler, 1996). L-BFGS was used over the complete training data to minimize the objective. This may cause problems in non-differentiable functions, but none was observed in practice.<br />
<br />
L-BFGS is short for Limited-memory BFGS, which is an iterative method for solving unconstrained nonlinear optimization problems that using a limited amount of computer memory. Thus it is particularly suited to problems with very large numbers of variables (e.g., >1000)<ref><br />
https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm<br />
</ref>.<br />
<br />
= Results =<br />
<br />
The parameters to tune in this algorithm are ''n'', the size of the hidden layer;'' κ'', the penalization term for incorrect parsing decisions and ''λ'', the regularization parameter. It was found that the method was quite robust, varying in performance by only a few percent for some perameter combinations. The parameter values used were ''n = 100, κ = 0.05<br />
and λ = 0.001.''<br />
<br />
Additional resources that are helpful for replicating and extending this work can be found [http://www.socher.org/index.php/Main/ParsingNaturalScenesAndNaturalLanguageWithRecursiveNeuralNetworks here] on the first author's personal website. This includes the source code for the project, as well as download links for the datasets used.<br />
<br />
== Scene understanding ==<br />
<br />
<br />
For training and testing the researchers opted for the Stanford Background dataset—a dataset that can roughly be categorized into three types: city, countryside and sea-side. The team labelled the images with these three labels and a SVM was trained using the average over all nodes' activations in the tree as features. With an accuracy of 88.1%, this algorithm outperforms the state-of-the art features for scene categorization, Gist descriptors, which obtained only 84.0%.<br />
The results are summarized in the following figure.<br />
[[File:pic6.png]]<br />
<br />
A single neural network layer followed by a softmax layer is also tested in this paper, which performed about 2% worse than the full RNN model.<br />
<br />
In order to show the learned feature representations captured import appearance and label information, the researchers visualized nearest neighbour super segments. The team computed nearest neighbours across all images and all such subtrees. The figure below shows the results. The first image in each row is a random subtree's top node and the remaining nodes are the closest subtrees in the dataset in terms of Euclidean distance between the vector representations.<br />
<br />
[[File:pic7.png]]<br />
<br />
== Natural language processing ==<br />
<br />
The method was also tested on natural language processing with the Wall Street Journal section of the Penn Treebank and was evaluated with the F-measure (Manning & Schütz, 1999). While the widely used Berkeley parser was not outperformed, the scores are close (91.63% vs 90.29%). Interestingly, no syntactic information of the child nodes is provided by the parser to the parent nodes. All syntatic information used is encoded in the learned continuous representations.<br />
<br />
Similar to the nearest neighbour scene subtrees, nearest neighbours for multiword phrases were collected. For example, "All the figures are adjusted for seasonal variatons" is a close neighbour to "All the numbers are adjusted for seasonal fluctuations".<br />
<br />
= Related work =<br />
Based on this work, the author published another paper to improve semantic representations, using Long Short-Term Memory (LSTM) networks which is a recurrent neural network with a more complex computational unit. It outperforms the existing systems on aspects of semantic relatedness and sentiment classification.<ref><br />
Tai K S, Socher R, Manning C D. Improved semantic representations from tree-structured long short-term memory networks[J]. arXiv preprint arXiv:1503.00075, 2015.<br />
</ref><br />
<br />
=Reference=<br />
<references /></div>Alcateri