http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Mgohari2&feedformat=atomstatwiki - User contributions [US]2023-01-30T17:31:48ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=from_Machine_Learning_to_Machine_Reasoning&diff=27350from Machine Learning to Machine Reasoning2015-12-17T22:32:46Z<p>Mgohari2: </p>
<hr />
<div>== Introduction ==<br />
Learning and reasoning are both essential abilities associated with intelligence. Consequently, machine learning and machine reasoning have received considerable attention given the short history of computer science. The statistical nature of machine learning is now understood but the ideas behind machine reasoning are much more elusive. Converting ordinary data into a set of logical rules proves to be very challenging: searching the discrete space of symbolic formulas leads to combinatorial explosion <ref>Lighthill, J. [http://www.math.snu.ac.kr/~hichoi/infomath/Articles/Lighthill%20Report.pdf "Artificial intelligence: a general survey."] In Artificial intelligence: a paper symposium. Science Research Council.</ref>. Algorithms for probabilistic inference <ref>Pearl, J. [http://bayes.cs.ucla.edu/BOOK-2K/neuberg-review.pdf "Causality: models, reasoning, and inference."] Cambridge: Cambridge University Press.</ref> still suffer from unfavourable computational properties <ref>Roth, D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.6074&rep=rep1&type=pdf "On the hardness of approximate reasoning"] Artificial Intelligence, 82, 273–302.</ref>. Algorithms for inference do exist but they do however, come at a price of reduced expressive capabilities in logical inference and probabilistic inference.<br />
<br />
Humans display neither of these limitations.<br />
<br />
The ability to reason is not the same as the ability to make logical inferences. The way that humans reason provides evidence to suggest the existence of a middle layer, already a form of reasoning, but not yet formal or logical. Informal logic is attractive because we hope to avoid the computational complexity that is associated with combinatorial searches in the vast space of discrete logic propositions.<br />
<br />
This paper shows how deep learning and multi-task learning can be leveraged as a rudimentary form of reasoning to help solve a task of interest.<br />
<br />
This approach is explored along a number of auxiliary tasks.<br />
<br />
== Auxiliary Tasks ==<br />
<br />
The usefulness of auxiliary tasks were examined within the contexts of two problems; face-based identification and natural language processing. Both these examples show how an easier task (determining whether two faces are different) can be used to boost performance on a harder task (identifying faces) using inference.<br />
<br />
'''Face-based Identification'''<br />
<br />
Identifying a person from face images is challenging. It remains expensive to collect and label millions of images representing the face of each subject with a good variety of positions and contexts. However, it is easier to collect training data for a slightly different task of telling whether two faces in images represent the same person or not: two faces in the same picture are likely to belong to two different people; two faces in successive video frames are likely to belong to the same person. These two tasks have much in common image analysis primitives, feature extraction, part recognizers trained on the auxiliary task can help solve the original task.<br />
<br />
Figure below illustrates a transfer learning strategy involving three trainable models. The preprocessor P computes a compact face representation of the image and the comparator labels the face. We first assemble two preprocessors P and one comparator D and train this model with abundant labels for the auxiliary task. Then we assemble another instance of P with classifier C and train the resulting model using a restrained number of labelled examples from the original task.<br />
<br />
[[File:figure1.JPG | center]]<br />
<br />
'''Natural Language Processing'''<br />
<br />
The auxiliary task in this case (left diagram of figure below) is identifying if a sentence is correct or not. This creates embedding for works in a 50 dimensional space. This embedding can than be used on the primary problem (right diagram of the figure below) of producing tags for the works. Note the shared classification "W" modules shared between the tasks.<br />
<br />
[[File:word_transfer.png | center]]<br />
<br />
== Reasoning Revisited ==<br />
Little attention has been paid to the rules that describe how to assemble trainable models that perform specific tasks. However, these composition rules play an extremely important rule as they describe algebraic manipulations that let us combine previously acquire knowledge in order to create a model that addresses a new task.<br />
<br />
We now draw a bold parallel: "algebraic manipulation of previously acquired knowledge in order to answer a new question" is a plausible definition of the word "reasoning".<br />
<br />
Composition rules can be described with very different levels of sophistication. For instance, graph transformer networks (depicted in the figure below) <ref>Bottou, L., LeCun, Y., & Bengio, Y. [http://www.iro.umontreal.ca/~lisa/pointeurs/bottou-lecun-bengio-97.pdf "Global training of document processing systems using graph transformer networks."] In Proc. of computer vision and pattern recognition (pp. 489–493). New York: IEEE Press.</ref> construct specific recognition and training models for each input image using graph transduction algorithms. The specification of the graph transducers then should be viewed as a description of the composition rules.<br />
<br />
[[File:figure5.JPG | center]]<br />
<br />
== Probabilistic Models ==<br />
Graphical models describe the factorization of joint probability distributions into elementary conditional distributions with specific independence assumptions. The probabilistic rules then induce an algebraic structure on the space of conditional probability distributions, describing relations in an arbitrary set of random variables. Many refinements have been devised to make the parametrization more explicit. The plate notation<ref name=BuW><br />
Buntine, Wray L [http://arxiv.org/pdf/cs/9412102.pdf"Operations for learning with graphical models"] in The Journal of Artificial Intelligence Research, (1994).<br />
</ref> compactly represents large graphical models with repeated structures that usually share parameters. More recent works propose considerably richer languages to describe large graphical probabilistic models. Such high order languages for describing probabilistic models are expressions of the composition rules described in the previous section.<br />
<br />
== Reasoning Systems ==<br />
We are no longer fitting a simple statistical model to data and instead, we are dealing with a more complex model consisting of (a) an algebraic space of models, and (b) composition rules that establish a correspondence between the space of models and the space of questions of interest. We call such an object a "reasoning system".<br />
<br />
Reasoning systems are unpredictable and thus vary in expressive power, predictive abilities and computational examples. A few examples include:<br />
*''First order logic reasoning'' - Consider a space of models composed of functions that predict the truth value of first order logic as a function of its free variables. This space is highly constrained by algebraic structure and hence, if we know some of these functions, we can apply logical inference to deduce or constrain other functions. First order logic is highly expressive because the bulk of mathematics can be formalized as first order logic statements <ref>Hilbert, D., & Ackermann, W.[https://www.math.uwaterloo.ca/~snburris/htdocs/scav/hilbert/hilbert.html "Grundzüge der theoretischen Logik."] Berlin: Springer.</ref>. However, this is not sufficient in expressing natural language: every first order logic formula can be expressed in natural language but the converse is not true. Finally, first order logic usually leads to computationally expensive algorithms.<br />
<br />
*''Probabilistic reasoning'' - Consider a space of models formed by all the conditional probability distributions associated with a set of predefined random variables. These conditional distributions are highly constrained by algebraic structure and hence, we can apply Bayesian inference to form deductions. Probability models are more computationally inexpensive but this comes at a price of lower expressive power: probability theory can be describe by first order logic but the converse is not true.<br />
<br />
*''Causal reasoning'' - The event "it is raining" and "people carry open umbrellas" is highly correlated and predictive: if people carry open umbrellas, then it is likely that it is raining. This does not, however, tell you the consequences of an intervention: banning umbrellas will not stop the train.<br />
<br />
*''Newtonian Mechanics'' - Classical mechanics is an example of the great predictive powers of causal reasoning. Newton's three laws of motion make very accurate predictions on the motion of bodies on our universe.<br />
<br />
*''Spatial reasoning'' - A change in visual scene with respect to one's change in viewpoint is also subjected to algebraic constraints.<br />
<br />
*''Social reasoning'' - Changes of viewpoints also play a very important role in social interactions.<br />
<br />
*''Non-falsifiable reasoning'' - Examples of non-falsifiable reasoning include mythology and astrology. Just like non-falsifiable statistical models, non-falsifiable reasoning systems are unlikely to have useful predictive capabilities.<br />
<br />
It is desirable to map the universe of reasoning system, but unfortunately, we cannot expect such theoretical advances on schedule. We can however, nourish our intuitions by empirically exploring the capabilities of algebraic structures designed for specific applicative domains.<br />
<br />
The replication of essential human cognitive processes such as scene analysis, language understanding, and social interactions form an important class of applications. These processes probably include a form of logical reasoning because are able to explain our conclusions with logical arguments. However, the actual processes happen without conscious involvement suggesting that the full complexity of logic reasoning is not required.<br />
<br />
The following sections describe more specific ideas investigating reasoning systems suitable for natural language processing and vision tasks.<br />
<br />
== Association and Dissociation ==<br />
We consider again a collection of trainable modules. The word embedding module W computes a continuous representation for each word of the dictionary. The association module is a trainable function that takes two vectors representation space and produces a single vector in the same space, which is suppose to represent the association of the two inputs. Given a sentence segment composed of ''n'' words, the figure below shows how ''n-1'' applications of the association module reduce the sentence segment to a single vector. We would like this vector to be a representation of the meaning of this sentence and each intermediate result to represent the meaning of the corresponding sentence fragment.<br />
<br />
[[File:figure6.JPG | center]]<br />
<br />
There are many ways of bracketing the same sentence to achieve a different meaning of that sentence. The figure below, for example, corresponds to the bracketing of the sentence "''((the cat) (sat (on (the mat))''". In order to determine which form of bracketing of the sentence splits the sentence into fragments that have the most meaning, we introduce a new scoring module R which takes in a sentence fragment and measures how meaningful is that corresponding sentence fragment.<br />
<br />
[[File:figure7.JPG | center]]<br />
<br />
The idea is to apply this R module to every intermediate result and summing all of the scores to get a global score. The task then, is to find a bracketing that maximizes this score. There is also the challenge of training these modules to achieve the desired function. The figure below illustrates a model inspired by Collobert et. al.<ref>Collobert, R., & Weston, J. [https://aclweb.org/anthology/P/P07/P07-1071.pdf "Fast semantic extraction using a novel neural network architecture."] In Proc. 45th annual meeting of the association of computational linguistics (ACL) (pp. 560–567).</ref><ref>Collobert, R. [http://ronan.collobert.com/pub/matos/2011_parsing_aistats.pdf "Deep learning for efficient discriminative parsing."] In Proc. artificial intelligence and statistics (AISTAT).</ref> This is a stochastic gradient descent method and during each iteration, a short sentence is randomly selected from a large corpus and bracketed as shown in the figure. An arbitrary word is the then replaced by a random word from the vocabulary. The parameters of all the modules are then adjusted using a simple gradient descent step.<br />
<br />
[[File:figure8.JPG | center]]<br />
<br />
In order to investigate how well the system maps words to the representation space, all two-word sequences of the 500 most common words were constructed and mapped into the representation space. The figure below shows the closest neighbors in the representation space of some of these sequences.<br />
<br />
[[File:figure9.JPG | center]]<br />
<br />
The disassociation module D is the opposite of the association model, that is, a trainable function that computes two representation space vectors from a single vector. When its input is a meaningful output of the association module, its output should be the two inputs of the association module. Stacking one instance of the association module and one instance of the dissociation module is equivalent to an auto-encoder.<br />
<br />
The association and dissociation modules can be seen similar to the <code>cons</code>, <code>car</code>, and <code>cdr</code> primitives of the Lisp programming languages. These statements are used to construct new objects from two individual objects (<code>cons</code>, "association") or extract the individual objects (<code>car</code> and <code>cdr</code>, "dissociation") from a constructed object. However, there is an important difference. The representation in Lisp is discrete, whereas the representation here is in a continuous vector space. This will limit the depth of structures that can be constructed (because of limited numerical precision), while at the same time it makes other vectors in numerical proximity of a representation also meaningful. This latter property makes search algorithms more efficient as it is possible to follow a gradient (instead of performing discrete jumps). Note that the presented idea of association and dissociation in a vector space is very similar to what is known as Vector Symbolic Architectures.<ref><br />
[http://arxiv.org/abs/cs/0412059 Gayler, Ross W. "Vector symbolic architectures answer Jackendoff's challenges for cognitive neuroscience." arXiv preprint cs/0412059 (2004).]<br />
</ref><br />
<br />
[[File:figure10.JPG | center]]<br />
<br />
Association and dissociation modules are not limited to just natural language processing tasks. A number of state-of-the-art systems for scene categorization and object recognition use a combination of strong local features, such as SIFT or HOG features, consolidated along a pyramidal structure. Similar pyramidal structure has been associated with the visual cortex. Pyramidal structures work poorly as image segmentation tools. Take for example, the figure below which shows that a large convolutional neural network provides good object recognition accuracies but coarse segmentation. This poor performance is due to fixed geometry of their spatial pooling layers. The lower layers aggregate the local features based on a predefined pattern and pass them to upper levels/ this aggregation causes poor spatial and orientation accuracy. One approach for resolving this drawback is parsing mechanism where intermediate representations can be attached to the image patches of image. <br />
<br />
The use of the association-dissociation modules of sort described in this section have been given more a general treatment in recent work on recursive neural networks, which similarly apply a single function to a sequence of inputs in a pairwise fashion to build up distributed representations of data (e.g. natural language sentences or segmented images).<ref><br />
[http://www.socher.org/uploads/Main/SocherHuvalManningNg_EMNLP2012.pdf Socher, R. et al. "Semantic compositionally though recursive matrix-vector spaces" EMNLP (2012).]<br />
</ref><ref><br />
[http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf Socher, R. et al. "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank" EMNLP (2013).]<br />
</ref>. A standard recurrent network can also be thought of as a special case of this approach in which the recursive application always proceeds left to right through the input sequence (i.e. there is no branching in the tree produced by unfolding the recursion through time). <br />
<br />
<br />
[[File:figure11.JPG | center]]<br />
<br />
Finally, we envision module that convert image representations into sentence representations and conversely. Given an image, we could parse the image and convert the final image representation into a sentence representation. Conversely, given a sentence, we could produce a sketch of the associated image by similar means.<br />
<br />
== Universal Parser ==<br />
The figure below shows a model of short-term memory (STM) capable of two possible actions: (1) inserting a new representation vector into the short-term memory and (2) apply the association module A to two representation vectors taken from the short-term memory and replacing them by the combined representation vector. Each application of the association module is scored using the saliency scoring module R. The algorithm terminates when STM contains a single representation vector and there are no more representation vectors to insert.<br />
<br />
[[File:figure12.JPG | center]]<br />
<br />
The algorithm design choices determine which data structure is most appropriate for implementing the STM. In the English language, sentences are created by words separated by spaces and therefore it is attractive to implement the STM as a stack and construct a shift/reduce parser.<br />
<br />
== More Modules ==<br />
The previous sections discussed the association and dissociation modules. Here, we discuss a few more modules that perform predefined transformations on natural language sentences; modules that implement specific visual reasoning primitives; and modules that bridge the representations of sentences and the representations of images.<br />
<br />
*Operator grammars <ref>Harris, Z. S. [https://books.google.ca/books/about/Mathematical_structures_of_language.html?id=qsbuAAAAMAAJ&redir_esc=y "Mathematical structures of language."] Volume 21 of Interscience tracts in pure and applied mathematics.</ref> provide a mathematical description of natural languages based on transformation operators.<br />
*There is also a natural framework for such enhancements in the case of vision. Modules working on the representation vectors can model the consequences of various interventions.<br />
<br />
== Representation Space ==<br />
Previous models have functions operating on low dimensional vector space but modules with similar algebraic properties could be defined on a different set of representation spaces. Such choices have a considerable impact on the computational and practice aspects of the training algorithms.<br />
*In order to provide sufficient capabilities, the trainable functions must often be designed with linear parameterizations. The algorithms are simple extensions of the multilayer network training procedures, using back-propagation and stochastic gradient descent.<br />
*Sparse vectors in much higher dimensional spaces are attractive because they provide the opportunity to rely more on trainable modules with linear parameterization.<br />
*The representation space can also be a space of probability distributions defined on a vector of discrete random variables. By this representation, the learning algorithms can be expressed as stochastic sampling in which sampling image at regular spaced locations replaced by the sampling at non-uniform spaced locations. Gibbs sampling or Markov-chain Monte-Carlo are two prominent technique for this purpose.<br />
<br />
== Conclusions ==<br />
The research directions outlined in this paper is intended to advance the practical and conceptual understanding of the relationship between machine learning and machine reasoning. Instead of trying to bridge the gap between machine learning and "all-purpose" inference mechanisms, we can instead algebraically enrich the set of manipulations applicable to a training system and building reasoning abilities from the ground up.<br />
<br />
== Bibliography ==<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=from_Machine_Learning_to_Machine_Reasoning&diff=27349from Machine Learning to Machine Reasoning2015-12-17T22:31:20Z<p>Mgohari2: </p>
<hr />
<div>== Introduction ==<br />
Learning and reasoning are both essential abilities associated with intelligence. Consequently, machine learning and machine reasoning have received considerable attention given the short history of computer science. The statistical nature of machine learning is now understood but the ideas behind machine reasoning are much more elusive. Converting ordinary data into a set of logical rules proves to be very challenging: searching the discrete space of symbolic formulas leads to combinatorial explosion <ref>Lighthill, J. [http://www.math.snu.ac.kr/~hichoi/infomath/Articles/Lighthill%20Report.pdf "Artificial intelligence: a general survey."] In Artificial intelligence: a paper symposium. Science Research Council.</ref>. Algorithms for probabilistic inference <ref>Pearl, J. [http://bayes.cs.ucla.edu/BOOK-2K/neuberg-review.pdf "Causality: models, reasoning, and inference."] Cambridge: Cambridge University Press.</ref> still suffer from unfavourable computational properties <ref>Roth, D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.6074&rep=rep1&type=pdf "On the hardness of approximate reasoning"] Artificial Intelligence, 82, 273–302.</ref>. Algorithms for inference do exist but they do however, come at a price of reduced expressive capabilities in logical inference and probabilistic inference.<br />
<br />
Humans display neither of these limitations.<br />
<br />
The ability to reason is not the same as the ability to make logical inferences. The way that humans reason provides evidence to suggest the existence of a middle layer, already a form of reasoning, but not yet formal or logical. Informal logic is attractive because we hope to avoid the computational complexity that is associated with combinatorial searches in the vast space of discrete logic propositions.<br />
<br />
This paper shows how deep learning and multi-task learning can be leveraged as a rudimentary form of reasoning to help solve a task of interest.<br />
<br />
This approach is explored along a number of auxiliary tasks.<br />
<br />
== Auxiliary Tasks ==<br />
<br />
The usefulness of auxiliary tasks were examined within the contexts of two problems; face-based identification and natural language processing. Both these examples show how an easier task (determining whether two faces are different) can be used to boost performance on a harder task (identifying faces) using inference.<br />
<br />
'''Face-based Identification'''<br />
<br />
Identifying a person from face images is challenging. It remains expensive to collect and label millions of images representing the face of each subject with a good variety of positions and contexts. However, it is easier to collect training data for a slightly different task of telling whether two faces in images represent the same person or not: two faces in the same picture are likely to belong to two different people; two faces in successive video frames are likely to belong to the same person. These two tasks have much in common image analysis primitives, feature extraction, part recognizers trained on the auxiliary task can help solve the original task.<br />
<br />
Figure below illustrates a transfer learning strategy involving three trainable models. The preprocessor P computes a compact face representation of the image and the comparator labels the face. We first assemble two preprocessors P and one comparator D and train this model with abundant labels for the auxiliary task. Then we assemble another instance of P with classifier C and train the resulting model using a restrained number of labelled examples from the original task.<br />
<br />
[[File:figure1.JPG | center]]<br />
<br />
'''Natural Language Processing'''<br />
<br />
The auxiliary task in this case (left diagram of figure below) is identifying if a sentence is correct or not. This creates embedding for works in a 50 dimensional space. This embedding can than be used on the primary problem (right diagram of the figure below) of producing tags for the works. Note the shared classification "W" modules shared between the tasks.<br />
<br />
[[File:word_transfer.png | center]]<br />
<br />
== Reasoning Revisited ==<br />
Little attention has been paid to the rules that describe how to assemble trainable models that perform specific tasks. However, these composition rules play an extremely important rule as they describe algebraic manipulations that let us combine previously acquire knowledge in order to create a model that addresses a new task.<br />
<br />
We now draw a bold parallel: "algebraic manipulation of previously acquired knowledge in order to answer a new question" is a plausible definition of the word "reasoning".<br />
<br />
Composition rules can be described with very different levels of sophistication. For instance, graph transformer networks (depicted in the figure below) <ref>Bottou, L., LeCun, Y., & Bengio, Y. [http://www.iro.umontreal.ca/~lisa/pointeurs/bottou-lecun-bengio-97.pdf "Global training of document processing systems using graph transformer networks."] In Proc. of computer vision and pattern recognition (pp. 489–493). New York: IEEE Press.</ref> construct specific recognition and training models for each input image using graph transduction algorithms. The specification of the graph transducers then should be viewed as a description of the composition rules.<br />
<br />
[[File:figure5.JPG | center]]<br />
<br />
== Probabilistic Models ==<br />
Graphical models describe the factorization of joint probability distributions into elementary conditional distributions with specific independence assumptions. The probabilistic rules then induce an algebraic structure on the space of conditional probability distributions, describing relations in an arbitrary set of random variables. Many refinements have been devised to make the parametrization more explicit. The plate notation<ref name=BuW><br />
Buntine, Wray L [http://arxiv.org/pdf/cs/9412102.pdf"Operations for learning with graphical models"] in The Journal of Artificial Intelligence Research, (1994).<br />
</ref> compactly represents large graphical models with repeated structures that usually share parameters. More recent works propose considerably richer languages to describe large graphical probabilistic models. Such high order languages for describing probabilistic models are expressions of the composition rules described in the previous section.<br />
<br />
== Reasoning Systems ==<br />
We are no longer fitting a simple statistical model to data and instead, we are dealing with a more complex model consisting of (a) an algebraic space of models, and (b) composition rules that establish a correspondence between the space of models and the space of questions of interest. We call such an object a "reasoning system".<br />
<br />
Reasoning systems are unpredictable and thus vary in expressive power, predictive abilities and computational examples. A few examples include:<br />
*''First order logic reasoning'' - Consider a space of models composed of functions that predict the truth value of first order logic as a function of its free variables. This space is highly constrained by algebraic structure and hence, if we know some of these functions, we can apply logical inference to deduce or constrain other functions. First order logic is highly expressive because the bulk of mathematics can be formalized as first order logic statements <ref>Hilbert, D., & Ackermann, W.[https://www.math.uwaterloo.ca/~snburris/htdocs/scav/hilbert/hilbert.html "Grundzüge der theoretischen Logik."] Berlin: Springer.</ref>. However, this is not sufficient in expressing natural language: every first order logic formula can be expressed in natural language but the converse is not true. Finally, first order logic usually leads to computationally expensive algorithms.<br />
<br />
*''Probabilistic reasoning'' - Consider a space of models formed by all the conditional probability distributions associated with a set of predefined random variables. These conditional distributions are highly constrained by algebraic structure and hence, we can apply Bayesian inference to form deductions. Probability models are more computationally inexpensive but this comes at a price of lower expressive power: probability theory can be describe by first order logic but the converse is not true.<br />
<br />
*''Causal reasoning'' - The event "it is raining" and "people carry open umbrellas" is highly correlated and predictive: if people carry open umbrellas, then it is likely that it is raining. This does not, however, tell you the consequences of an intervention: banning umbrellas will not stop the train.<br />
<br />
*''Newtonian Mechanics'' - Classical mechanics is an example of the great predictive powers of causal reasoning. Newton's three laws of motion make very accurate predictions on the motion of bodies on our universe.<br />
<br />
*''Spatial reasoning'' - A change in visual scene with respect to one's change in viewpoint is also subjected to algebraic constraints.<br />
<br />
*''Social reasoning'' - Changes of viewpoints also play a very important role in social interactions.<br />
<br />
*''Non-falsifiable reasoning'' - Examples of non-falsifiable reasoning include mythology and astrology. Just like non-falsifiable statistical models, non-falsifiable reasoning systems are unlikely to have useful predictive capabilities.<br />
<br />
It is desirable to map the universe of reasoning system, but unfortunately, we cannot expect such theoretical advances on schedule. We can however, nourish our intuitions by empirically exploring the capabilities of algebraic structures designed for specific applicative domains.<br />
<br />
The replication of essential human cognitive processes such as scene analysis, language understanding, and social interactions form an important class of applications. These processes probably include a form of logical reasoning because are able to explain our conclusions with logical arguments. However, the actual processes happen without conscious involvement suggesting that the full complexity of logic reasoning is not required.<br />
<br />
The following sections describe more specific ideas investigating reasoning systems suitable for natural language processing and vision tasks.<br />
<br />
== Association and Dissociation ==<br />
We consider again a collection of trainable modules. The word embedding module W computes a continuous representation for each word of the dictionary. The association module is a trainable function that takes two vectors representation space and produces a single vector in the same space, which is suppose to represent the association of the two inputs. Given a sentence segment composed of ''n'' words, the figure below shows how ''n-1'' applications of the association module reduce the sentence segment to a single vector. We would like this vector to be a representation of the meaning of this sentence and each intermediate result to represent the meaning of the corresponding sentence fragment.<br />
<br />
[[File:figure6.JPG | center]]<br />
<br />
There are many ways of bracketing the same sentence to achieve a different meaning of that sentence. The figure below, for example, corresponds to the bracketing of the sentence "''((the cat) (sat (on (the mat))''". In order to determine which form of bracketing of the sentence splits the sentence into fragments that have the most meaning, we introduce a new scoring module R which takes in a sentence fragment and measures how meaningful is that corresponding sentence fragment.<br />
<br />
[[File:figure7.JPG | center]]<br />
<br />
The idea is to apply this R module to every intermediate result and summing all of the scores to get a global score. The task then, is to find a bracketing that maximizes this score. There is also the challenge of training these modules to achieve the desired function. The figure below illustrates a model inspired by Collobert et. al.<ref>Collobert, R., & Weston, J. [https://aclweb.org/anthology/P/P07/P07-1071.pdf "Fast semantic extraction using a novel neural network architecture."] In Proc. 45th annual meeting of the association of computational linguistics (ACL) (pp. 560–567).</ref><ref>Collobert, R. [http://ronan.collobert.com/pub/matos/2011_parsing_aistats.pdf "Deep learning for efficient discriminative parsing."] In Proc. artificial intelligence and statistics (AISTAT).</ref> This is a stochastic gradient descent method and during each iteration, a short sentence is randomly selected from a large corpus and bracketed as shown in the figure. An arbitrary word is the then replaced by a random word from the vocabulary. The parameters of all the modules are then adjusted using a simple gradient descent step.<br />
<br />
[[File:figure8.JPG | center]]<br />
<br />
In order to investigate how well the system maps words to the representation space, all two-word sequences of the 500 most common words were constructed and mapped into the representation space. The figure below shows the closest neighbors in the representation space of some of these sequences.<br />
<br />
[[File:figure9.JPG | center]]<br />
<br />
The disassociation module D is the opposite of the association model, that is, a trainable function that computes two representation space vectors from a single vector. When its input is a meaningful output of the association module, its output should be the two inputs of the association module. Stacking one instance of the association module and one instance of the dissociation module is equivalent to an auto-encoder.<br />
<br />
The association and dissociation modules can be seen similar to the <code>cons</code>, <code>car</code>, and <code>cdr</code> primitives of the Lisp programming languages. These statements are used to construct new objects from two individual objects (<code>cons</code>, "association") or extract the individual objects (<code>car</code> and <code>cdr</code>, "dissociation") from a constructed object. However, there is an important difference. The representation in Lisp is discrete, whereas the representation here is in a continuous vector space. This will limit the depth of structures that can be constructed (because of limited numerical precision), while at the same time it makes other vectors in numerical proximity of a representation also meaningful. This latter property makes search algorithms more efficient as it is possible to follow a gradient (instead of performing discrete jumps). Note that the presented idea of association and dissociation in a vector space is very similar to what is known as Vector Symbolic Architectures.<ref><br />
[http://arxiv.org/abs/cs/0412059 Gayler, Ross W. "Vector symbolic architectures answer Jackendoff's challenges for cognitive neuroscience." arXiv preprint cs/0412059 (2004).]<br />
</ref><br />
<br />
[[File:figure10.JPG | center]]<br />
<br />
Association and dissociation modules are not limited to just natural language processing tasks. A number of state-of-the-art systems for scene categorization and object recognition use a combination of strong local features, such as SIFT or HOG features, consolidated along a pyramidal structure. Similar pyramidal structure has been associated with the visual cortex. Pyramidal structures work poorly as image segmentation tools. Take for example, the figure below which shows that a large convolutional neural network provides good object recognition accuracies but coarse segmentation. This poor performance is due to fixed geometry of their spatial pooling layers. The lower layers aggregate the local features based on a predefined pattern and pass them to upper levels/ this aggregation causes poor spatial and orientation accuracy. One approach for resolving this drawback is parsing mechanism where intermediate representations can be attached to the image patches of image. <br />
<br />
The use of the association-dissociation modules of sort described in this section have been given more a general treatment in recent work on recursive neural networks, which similarly apply a single function to a sequence of inputs in a pairwise fashion to build up distributed representations of data (e.g. natural language sentences or segmented images).<ref><br />
[http://www.socher.org/uploads/Main/SocherHuvalManningNg_EMNLP2012.pdf Socher, R. et al. "Semantic compositionally though recursive matrix-vector spaces" EMNLP (2012).]<br />
</ref><ref><br />
[http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf Socher, R. et al. "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank" EMNLP (2013).]<br />
</ref>. A standard recurrent network can also be thought of as a special case of this approach in which the recursive application always proceeds left to right through the input sequence (i.e. there is no branching in the tree produced by unfolding the recursion through time). <br />
<br />
<br />
[[File:figure11.JPG | center]]<br />
<br />
Finally, we envision module that convert image representations into sentence representations and conversely. Given an image, we could parse the image and convert the final image representation into a sentence representation. Conversely, given a sentence, we could produce a sketch of the associated image by similar means.<br />
<br />
== Universal Parser ==<br />
The figure below shows a model of short-term memory (STM) capable of two possible actions: (1) inserting a new representation vector into the short-term memory and (2) apply the association module A to two representation vectors taken from the short-term memory and replacing them by the combined representation vector. Each application of the association module is scored using the saliency scoring module R. The algorithm terminates when STM contains a single representation vector and there are no more representation vectors to insert.<br />
<br />
[[File:figure12.JPG | center]]<br />
<br />
The algorithm design choices determine which data structure is most appropriate for implementing the STM. In the English language, sentences are created by words separated by spaces and therefore it is attractive to implement the STM as a stack and construct a shift/reduce parser.<br />
<br />
== More Modules ==<br />
The previous sections discussed the association and dissociation modules. Here, we discuss a few more modules that perform predefined transformations on natural language sentences; modules that implement specific visual reasoning primitives; and modules that bridge the representations of sentences and the representations of images.<br />
<br />
*Operator grammars <ref>Harris, Z. S. [https://books.google.ca/books/about/Mathematical_structures_of_language.html?id=qsbuAAAAMAAJ&redir_esc=y "Mathematical structures of language."] Volume 21 of Interscience tracts in pure and applied mathematics.</ref> provide a mathematical description of natural languages based on transformation operators.<br />
*There is also a natural framework for such enhancements in the case of vision. Modules working on the representation vectors can model the consequences of various interventions.<br />
<br />
== Representation Space ==<br />
Previous models have functions operating on low dimensional vector space but modules with similar algebraic properties could be defined on a different set of representation spaces. Such choices have a considerable impact on the computational and practice aspects of the training algorithms.<br />
*In order to provide sufficient capabilities, the trainable functions must often be designed with linear parameterizations. The algorithms are simple extensions of the multilayer network training procedures, using back-propagation and stochastic gradient descent.<br />
*Sparse vectors in much higher dimensional spaces are attractive because they provide the opportunity to rely more on trainable modules with linear parameterization.<br />
*The representation space can also be a space of probability distributions defined on a vector of discrete random variables.<br />
<br />
== Conclusions ==<br />
The research directions outlined in this paper is intended to advance the practical and conceptual understanding of the relationship between machine learning and machine reasoning. Instead of trying to bridge the gap between machine learning and "all-purpose" inference mechanisms, we can instead algebraically enrich the set of manipulations applicable to a training system and building reasoning abilities from the ground up.<br />
<br />
== Bibliography ==<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Sparse_Rectifier_Neural_Networks&diff=27348deep Sparse Rectifier Neural Networks2015-12-17T22:28:24Z<p>Mgohari2: </p>
<hr />
<div>= Introduction =<br />
<br />
Machine learning scientists and computational neuroscientists deal with neural networks differently. Machine learning scientists aim to obtain models that are easy to train and easy to generalize, while neuroscientists' objective is to produce useful representation of the scientific data. In other words, machine learning scientists care more about efficiency, while neuroscientists care more about interpretability of the model.<br />
<br />
In this paper they show that two common gaps between computational neuroscience models and machine learning neural network models can be bridged by rectifier activation function. One is between deep networks learnt with and without unsupervised pre-training; the other one is between the activation function and sparsity in neural networks.<br />
<br />
== Biological Plausibility and Sparsity ==<br />
<br />
In the brain, neurons rarely fire at the same time as a way to balance quality of representation and energy conservation. This is in stark contrast to sigmoid neurons which fire at 1/2 of their maximum rate when at zero. A solution to this problem is to use a rectifier neuron which does not fire at it's zero value. This rectifier linear unit is inspired by a common biological model of neuron, the leaky integrate-and-fire model (LIF), proposed by Dayan and Abott<ref><br />
Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems<br />
</ref>. It's function is illustrated in the figure below (middle).<br />
<br />
<gallery><br />
Image:sig_neuron.png|Sigmoid and TANH Neuron<br />
Image:lif_neuron.png|Leaky Integrate Fire Neuron<br />
Image:rect_neuron.png|Rectified Linear Neuron<br />
</gallery><br />
<br />
Given that the rectifier neuron has a larger range of inputs that will be output as zero, it's representation will obviously be more sparse. In the paper, the two most salient advantages of sparsity are:<br />
<br />
- '''Information Disentangling''' As opposed to a dense representation, where every slight input change results in a considerable output change, a the non-zero items of a sparse representation remain almost constant to slight input changes.<br />
<br />
- '''Variable Dimensionality''' A sparse representation can effectively choose how many dimensions to use to represent a variable, since it choose how many non-zero elements to contribute. Thus, the precision is variable, allowing for more efficient representation of complex items.<br />
<br />
Further benefits of a sparse representation and rectified linear neurons in particular are better linear separability (because the input is represented in a higher-dimensional space) and less computational complexity (most units are off and for on-units only a linear functions has to be computed).<br />
<br />
However, it should also be noted that sparsity reduces the capacity of the model because each unit takes part in the representation of fewer values.<br />
<br />
== Advantages of rectified linear units ==<br />
<br />
The rectifier activation function <math>\,max(0, x)</math> allows a network to easily obtain sparse representations since only a subset of hidden units will have a non-zero activation value for some given input and this sparsity can be further increased through regularization methods. Therefore, the rectified linear activation function will utilize the advantages listed in the previous section for sparsity.<br />
<br />
For a given input, only a subset of hidden units in each layer will have non-zero activation values. The rest of the hidden units will have zero and they are essentially turned off. Each hidden unit activation value is then composed of a linear combination of the active (non-zero) hidden units in the previous layer due to the linearity of the rectified linear function. By repeating this through each layer, one can see that the neural network is actually an exponentially increasing number of linear models who share parameters since the later layers will use the same values from the earlier layers. Since the network is linear, the gradient is easy to calculate and compute and travels back through the active nodes without vanishing gradient problem caused by non-linear sigmoid or tanh functions. In addition to the standard one, three versions of ReLUthat are modified as: Leaky, Parametric, and Randomized leaky ReLU. <br />
<br />
The sparsity and linear model can be seen in the figure the researchers made:<br />
<br />
[[File:RLU.PNG]]<br />
<br />
Each layer is a linear combination of the previous layer.<br />
<br />
== Potential problems of rectified linear units ==<br />
<br />
The zero derivative below zero in the rectified neurons blocks the back-propagation of the gradient during learning. Using a smooth variant of the rectification non-linearity (the softplus activation) this effect was investigated. Surprisingly, the results suggest the hard rectifications performs better. The authors hypothesize that the hard rectification is not a problem as long as the gradient can be propagated along some paths through the network and that the complete shut-off with the hard rectification sharpens the credit attribution to neurons in the learning phase.<br />
<br />
Furthermore, the unbounded nature of the rectification non-linearity can lead to numerical instabilities if activations grow too large. To circumvent this a <math>L_1</math> regularizer is used. Also, if symmetry is required, this can be obtained by using two rectifier units with shared parameters, but requires twice as many hidden units as a network with a symmetric activation function.<br />
<br />
Finally, rectifier networks are subject to ill conditioning of the parametrization. Biases and weights can be scaled in different (and consistent) ways while preserving the same overall network function.<br />
<br />
This paper addresses several difficulties when one wants to use rectifier activation into stacked denoising auto-encoder. The author have experienced several strategies to try to solve these problem.<br />
<br />
1. Use a softplus activation function for the reconstruction layer, along with a quadratic cost: <math> L(x, \theta) = ||x-log(1+exp(f(\tilde{x}, \theta)))||^2</math><br />
<br />
2. scale the rectifier activation values between 0 and 1, then use a sigmoid activation function for the reconstruction layer, along with a cross-entropy reconstruction cost. <math> L(x, \theta) = -xlog(\sigma(f(\tilde{x}, \theta))) - (1-x)log(1-\sigma(f(\tilde{x}, \theta))) </math><br />
<br />
The first strategy yield better generalization on image data and the second one on text data.<br />
<br />
= Experiments =<br />
<br />
Networks with rectifier neurons were applied to the domains of image recognition and sentiment analysis. The datasets for image recognition included both black and white (MNIST, NISTP), colour (CIFAR10) and stereo (NORB) images.<br />
<br />
The datasets for sentiment analysis were taken from opentable.com and Amazon. The task of both was to predict the star rating based off the text blurb of the review.<br />
<br />
== Results ==<br />
<br />
'''Results from image classification'''<br />
[[File:rectifier_res_1.png]]<br />
<br />
'''Results from sentiment classification'''<br />
[[File:rectifier_res_2.png]]<br />
<br />
For image recognition task, they find that there is almost no improvement when using unsupervised pre-training with rectifier activations, contrary to what is experienced using tanh or softplus. However, it achieves best performance when the network is trained Without unsupervised pre-training.<br />
<br />
In the NORB and sentiment analysis cases, the network benefited greatly from pre-training. However, the benefit in NORB diminished as the training set size grew.<br />
<br />
The result from the Amazon dataset was 78.95%, while the state of the art was 73.72%.<br />
<br />
The sparsity achieved with the rectified linear neurons helps to diminish the gap between networks with unsupervised pre-training and no pre-training.<br />
<br />
== Discussion / Criticism ==<br />
<br />
* Rectifier neurons really aren't biologically plausible for a variety of reasons. Namely, the neurons in the cortex do not have tuning curves resembling the rectifier. Additionally, the ideal sparsity of the rectifier networks were from 50 to 80%, while the brain is estimated to have a sparsity of around 95 to 99%.<br />
<br />
* The Sparsity property encouraged by ReLu is a double edged sword, while sparsity encourages information disentangling, efficient variable-size representation, linear separability, increased robustness as suggested by the author of this paper, <ref>Szegedy, Christian, et al. "Going deeper with convolutions." arXiv preprint arXiv:1409.4842 (2014).</ref> argues that computing sparse non-uniform data structures is very inefficient, the overhead and cache-misses would make it computationally expensive to justify using sparse data structures.<br />
<br />
* ReLu does not have vanishing gradient problems<br />
<br />
* ReLu can be prone to "die", in other words it may output same value regardless of what input you give the ReLu unit. This occurs when a large negative bias to the unit is learnt causing the output of the ReLu to be zero, thus getting stuck at zero because gradient at zero is zero. Solutions to mitigate this problem include techniques such as Leaky ReLu and Maxout.<br />
<br />
= Bibliography =<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=distributed_Representations_of_Words_and_Phrases_and_their_Compositionality&diff=27327distributed Representations of Words and Phrases and their Compositionality2015-12-15T21:18:24Z<p>Mgohari2: </p>
<hr />
<div>= Introduction =<br />
<br />
This paper<ref><br />
Mikolov, Tomas, et al. [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf "Distributed representations of words and phrases and their compositionality."] Advances in neural information processing systems. 2013.<br />
</ref> presents several extensions of the Skip-gram model introduced by Mikolov et al. <ref name=MiT> Mikolov, Tomas, ''et al'' [http://arxiv.org/pdf/1301.3781v3.pdf"Efficient Estimation of Word Representations in Vector Space"] in ICLR Workshop, (2013). </ref>. The Skip-gram model is an efficient method for learning high-quality vector representations of words from large amounts of unstructured text data. The word representations computed using this model are very interesting because the learned vectors explicitly encode many linguistic regularities and patterns. Somewhat surprisingly, many of these patterns can be represented as linear translations. For example, the result of a vector calculation vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector. The authors of this paper show that subsampling of frequent words during training results in a significant speedup and improves accuracy of the representations of less frequent words. In addition, a simplified variant of Noise Contrastive Estimation (NCE) <ref name=GuM><br />
Gutmann, Michael U, ''et al'' [http://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann12JMLR.pdf"Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics"] in The Journal ofMachine Learning Research, (2012).<br />
</ref>. for training the Skip-gram model is presented that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax that was used in the prior work <ref name=MiT></ref>. It also shows that a non-obvious degree of language understanding can be obtained by using basic mathematical operations on the word vector representations. For example, vec(“Russia”) + vec(“river”) is close to vec(“Volga River”), and vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”).<br />
<br />
= The Skip-gram Model =<br />
<br />
The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document. More formally, given a sequence of training words <math>w_1, w_2,..., w_T</math> the objective of the Skip-gram model is to maximize the average log probability:<br />
<br />
<math><br />
\frac{1}{T} \sum_{t=1}^{T} \sum_{-c\leq j\leq c} log(p(w_{t+j}|w_t))<br />
</math><br />
<br /><br />
<br /><br />
where <math>c</math> is the size of the training context (which can be a function of the center word <math>w_t</math>) and <math>p(w_{t+j}|w_t)</math> is defined using softmax function:<br />
<br />
<math><br />
p(w_O|w_I) = \frac{exp ({v'_{W_O}}^T v_{W_I})}{\sum{w=1}^{W} exp ({v'_{W}}^T v_{W_I})}<br />
</math><br />
<br />
Here, <math>v_w</math> and <math>v'_w</math> are the “''input''” and “''output''” vector representations of <math>w</math>, and <math>W</math> is the number of words in the vocabulary.<br />
<br />
== Hierarchical Softmax ==<br />
<br />
Hierarchical Softmax is a computationally efficient approximation of the full softmax <ref name=MoF><br />
Morin, Frederic, ''et al'' [http://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf"Hierarchical probabilistic neural network language model"] in Proceedings of the international workshop on artificial intelligence and statistics, (2015).<br />
</ref>. Hierarchical Softmax evaluate only about <math>log_2(W)</math> output nodes instead of evaluating <math>W</math> nodes in the neural network to obtain the probability distribution.<br />
<br />
The hierarchical softmax uses a binary tree representation of the output layer with the <math>W</math> words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes. These define a random walk that assigns probabilities to words.<br />
<br />
Let <math>n(w,j)</math> be the <math>j^{th}</math> node on the path from the root to <math>w</math>, and let <math>L(w)</math> be the length of this path, so <math>n(w,1) = root</math> and <math>n(w,L(w)) = w</math>. In addition, for any inner node <math>n</math>, let <math>ch(n)</math> be an arbitrary fixed child of <math>n</math> and let <math>[[x]]</math> be 1 if <math>x</math> is true and -1 otherwise. Then the hierarchical softmax defines <math>p(w_O|w_I )</math> as follows:<br />
<br />
<math><br />
p(w|w_I) = \prod_{j=1}^{L(w)-1} \sigma ([[n(w,j+1)=ch(n(w,j))]]{v'_{n(w,j)}}^T v_{W_I}) <br />
</math><br />
<br />
where<br />
<br />
<math><br />
\sigma (x)=\frac{1}{1+exp(-x)}<br />
</math><br />
<br />
In this paper, a binary Huffman tree is used as the structure for the hierarchical softmax because it assigns short codes to the frequent words which results in fast training. It has been observed before that grouping words together by their frequency works well as a very simple speedup technique for the neural network based language models <ref name=MiT></ref><ref name=MiT2><br />
Mikolov, Tomas, ''et al'' [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5947611"Extensions of recurrent neural network language model."] in Acoustics, Speech and Signal Processing (ICASSP), (2011).<br />
</ref>.<br />
<br />
<br />
<br />
== Negative Sampling==<br />
<br />
Noise Contrastive Estimation (NCE) is an alternative to the hierarchical softmax. NCE indicates that a good model should be able to differentiate data from noise by means of logistic regression. While NCE can be shown to approximately maximize the log probability of the softmax, the Skipgram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality. Negative sampling (NEG) is defined by the objective:<br />
<br />
<math><br />
log \sigma ({v'_{W_O}}^T v_{W_I})+\sum_{i=1}^{k} \mathbb{E}_{w_i\sim P_n(w)}[log \sigma ({-v'_{W_i}}^T v_{W_I})]<br />
</math><br />
<br />
The main difference between the Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples. And while NCE approximately maximizes the log probability of the softmax, this property is not important for our application.<br /><br />
Both NCE and NEG have the noise distribution <math>P_n(w)</math> as a free parameter. We investigated a number of choices for <math>P_n(w)</math> and found that the unigram distribution <math>U(w)</math> raised to the 3/4rd power (i.e., <math>U(w)^{3/4}/Z)</math> outperformed significantly the unigram and the uniform distributions, for both NCE and NEG on every task we tried including language modeling.<br />
<br />
==Subsampling of Frequent Words==<br />
In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g., “in”, “the”, and “a”). Such words usually provide less information about the surrounding words that rarer words (i.e., "the" provides little information about the next word because it co-occurs with a huge number of words), and the representation of the frequent word will be unlikely to change significantly after many iterations. <br />
<br />
To counter the imbalance between the rare and frequent words, a simple subsampling approach is used. Each word <math>w_i</math> in the training set is discarded with probability computed by the formula:<br />
<br />
<math><br />
P(w_i)=1-\sqrt{\frac{t}{f(w_i)}}<br />
</math><br />
<br />
where <math>f(w_i)</math> is the frequency of word <math>w_i</math> and <math>t</math> is a chosen threshold, typically around <math>10^{−5}</math>.<br />
<br />
= Empirical Results=<br />
<br />
The Hierarchical Softmax (HS), Noise Contrastive Estimation, Negative Sampling, and subsampling of the training words are evaluated with the help of the analogical reasoning task1 <ref name=MiT></ref>. The task consists of analogies such as “Germany” : “Berlin” :: “France” : ?, which are solved by finding a vector ''x'' such that vec(''x'') is closest to vec(“Berlin”) - vec(“Germany”) + vec(“France”) according to the cosine distance. This specific example is considered to have been answered correctly if ''x'' is “Paris”. The task has two broad categories: the syntactic analogies (such as “quick” : “quickly” :: “slow” : “slowly”) and the semantic analogies, such as the country to capital city relationship.<br />
<br />
For training the Skip-gram models, a large dataset consisting of various news articles is used (an internal Google dataset with one billion words). All words that occurred less than 5 times in the training data were discarded, which resulted in a vocabulary of size 692K. The performance of various Skip-gram models on the word analogy test set is reported in Table 1. The table shows that Negative Sampling outperforms the Hierarchical Softmax on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation. The subsampling of the frequent words improves the training speed several times and makes the word representations significantly more accurate.<br />
<br />
<center><br />
[[File:Tb_1.PNG | frame | center |Table 1. Accuracy of various Skip-gram 300-dimensional models on the analogical reasoning task as defined in <ref name=MiT></ref>. NEG-''k'' stands for Negative Sampling with ''k'' negative samples for each positive sample; NCE stands for Noise Contrastive Estimation and HS-Huffman stands for the Hierarchical Softmax with the frequency-based Huffman codes. ]]<br />
</center><br />
<br />
=Learning Phrases=<br />
<br />
Many phrases have a meaning that is not a simple composition of the meanings of its individual words. To learn vector representation for phrases, we first find words that appear frequently together, and infrequently in other contexts. For example, “''New York Times''” and “''Toronto Maple Leafs''” are replaced by unique tokens in the training data, while a bigram “''this is''” will remain unchanged. This way, we can form many reasonable phrases without greatly increasing the size of the vocabulary; in theory, we can train the Skip-gram model using all n-grams, but that would be too memory intensive. A simple data-driven approach, where phrases are formed based on the unigram and bigram counts is applied to identify the phrases. In this approach, a ''score'' is calculated as:<br />
<br />
<math><br />
score(w_i,w_j)=\frac{count(w_iw_j)-\delta}{count(w_i)count(w_j)}<br />
</math><br />
<br />
The <math>\delta</math> is used as a discounting coefficient and prevents too many phrases consisting of very infrequent words to be formed. The bigrams with ''scores'' above the chosen threshold are then used as phrases. The quality of the phrase representations is evaluated using a new analogical reasoning task that involves phrases. Table 2 shows examples of the five categories of analogies used in this task.<br />
<br />
<center><br />
[[File:Tb_2.PNG | frame | center |Table 2. Examples of the analogical reasoning task for phrases (the full test set has 3218 examples). The goal is to compute the fourth phrase using the first three. Our best model achieved an accuracy of 72% on this dataset.]]<br />
</center><br />
<br />
==Phrase Skip-Gram Results==<br />
<br />
First, the phrase based training corpus is constructed and then Skip-gram models are trained using different hyperparameters. Table 3 shows the results using vector dimensionality 300 and context size 5. This setting already achieves good performance on the phrase dataset, and allowed us to quickly compare the Negative Sampling and the Hierarchical Softmax, both with and without subsampling of the frequent tokens. The results show that while Negative Sampling achieves a respectable accuracy even with ''k = 5'', using ''k = 15'' achieves considerably better performance. Also, the subsampling can result in faster training and can also improve accuracy, at least in some cases.<br />
<br />
<center><br />
[[File:Tb_3.PNG | frame | center |Table 3. Accuracies of the Skip-gram models on the phrase analogy dataset. The models were trained on approximately one billion words from the news dataset.]]<br />
</center><br />
<br />
<br />
The amount of the training data was increased to 33 billion words in order to maximize the accuracy on the phrase analogy task. Hierarchical softmax, dimensionality of 1000, and the entire sentence for the context were used. This resulted in a model that reached an accuracy of 72%. Reducing the size of the training dataset to 6 billion caused lower accuracy (66%), which suggests that large amount of the training data is crucial. To gain further insight into how different the representations learned by different models are, nearest neighbors of infrequent phrases were inspected manually using various models. In Table 4 shows a sample of such comparison. Consistently with the previous results, it seems that the best representations of phrases are learned by a model with the hierarchical softmax and subsampling.<br />
<br />
<center><br />
[[File:Tb_4.PNG | frame | center |Table 4. Examples of the closest entities to the given short phrases, using two different models.]]<br />
</center><br />
<br />
=Additive Compositionality=<br />
<br />
The word and phrase representations learned by the Skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetics. Also, the Skip-gram representations exhibit another kind of linear structure that makes it possible to meaningfully combine words by an element-wise addition of their vector representations. This phenomenon is illustrated in Table 5. The additive property of the vectors can be explained by inspecting the training objective. The word vectors are in a linear relationship with the inputs to the softmax nonlinearity. As the word vectors are trained to predict the surrounding words in the sentence, the vectors can be seen as representing the distribution of the context in which a word appears. These values are related logarithmically to the probabilities computed by the output layer, so the sum of two word vectors is related to the product of the two context distributions. The product works here as the AND function: words that are assigned high probabilities by both word vectors will have high probability, and the other words will have low probability.<br />
<br />
<center><br />
[[File:Tb_5.PNG | frame | center |Table 5. Vector compositionality using element-wise addition. Four closest tokens to the sum of two vectors are shown, using the best Skip-gram model.]]<br />
</center><br />
<br />
=Comparison to Published Word Representations=<br />
<br />
Table 6 shows the empirical comparison between different neural network-based representations of words by showing the nearest neighbors of infrequent words. These examples show that the big Skip-gram model trained on a large corpus visibly outperforms all the other models in the quality of the learned representations. This can be attributed in part to the fact that this model has been trained on about 30 billion words, which is about two to three orders of magnitude more data than the typical size used in the prior work. Interestingly, although the training set is much larger, the training time of the Skip-gram model is just a fraction of the time complexity required by the previous model architectures.<br />
<br />
<center><br />
[[File:Tb_6.PNG | frame | center |Table 6. Examples of the closest tokens given various well-known models and the Skip-gram model trained on phrases using over 30 billion training words. An empty cell means that the word was not in the vocabulary.]]<br />
</center><br />
<br />
=Conclusion=<br />
<br />
This work has the following key contributions:<br />
<br />
1. This work shows how to train distributed representations of words and phrases with the Skip-gram model and demonstrate that these representations exhibit linear structure that makes precise analogical reasoning possible.<br />
<br />
2. It is a computationally efficient model architecture which results in successfully train models on several orders of magnitude more data than the previously published models.<br />
<br />
3. Introducing Negative sampling algorithm, which is an extremely simple training method that learns accurate representations especially for frequent words.<br />
<br />
4. The choice of the training algorithm and the hyper-parameter selection is a task specific decision. It is shown that the most crucial decisions that affect the performance are the choice of the model architecture, the size of the vectors, the subsampling rate, and the size of the training window.<br />
<br />
5. The word vectors can be meaningfully combined using just simple vector addition. Another approach for learning representations of phrases presented in this paper is to simply represent the phrases with a single token. Combining these two approaches gives a powerful yet simple way how to represent longer pieces of text, while having minimal computational complexity.<br />
<br />
Le et al<ref><br />
Le Q, Mikolov T. [http://arxiv.org/pdf/1405.4053v2.pdf "Distributed Representations of Sentences and Documents"]. Proceedings of the 31 st International Conference on Machine Learning, 2014 </ref> have used the idea of the current paper for learning paragraph vectors. In that later work they used paragraph vectors for prediction of the next word. Every word and also every paragraph is mapped to a unique vector represented in a column of two different matrices W and D. Then paragraph vectors and word vectors are concatenated to contribute for predicting the next word. <br />
<br />
= Recursive Autoencoder =<br />
<br />
This is taken from paper 'Semi-supervised recursive autoencoders for predicting sentiment distributions'.<ref> Socher, et al. [http://www.socher.org/uploads/Main/SocherPenningtonHuangNgManning_EMNLP2011.pdf] </ref><br />
=== Other techniques for sentence representation ===<br />
<br />
The idea of Recursive Autoencoder is summarized in the figure below. The example illustrates the recursive autoencoder to a binary tree.<br />
<center><br />
[[File:Recur-auto.png]]<br />
</center><br />
<br />
Assume given a list of word vectors <math> x = (x_1, ..., x_m)</math>, we need to branch triplets of parents with children: <math> (y_1 \rightarrow x_3x_4), (y_2 \rightarrow x_2y_1), (y_3 \rightarrow x_1y_2) </math>.<br />
<br />
The first parent <math> y_1 </math> is computed from the children <math> (c_1, c_2) = (x_3, x_4)</math>: <math> p=f(W^{(1)}[c_1; c_2] + b^{(1)})</math> , where W is the parameter matrix and b is bias term. <br />
<br />
The autoencoder comes in by reconstructing children set <math> [c_1^'; c_2^'] = W^{(2)}p + b^{(2)}</math>. The object of this method is to minimized the MSE of original children set and the reconstructed children set.<br />
<br />
=Resources=<br />
<br />
The code for training the word and phrase vectors based on this paper is available in the open source project [https://code.google.com/p/word2vec/ word2vec]. This project also contains a set of pre-trained 300-dimensional vectors for 3 million words and phrases.<br />
<br />
=References=<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=distributed_Representations_of_Words_and_Phrases_and_their_Compositionality&diff=27326distributed Representations of Words and Phrases and their Compositionality2015-12-15T21:09:12Z<p>Mgohari2: </p>
<hr />
<div>= Introduction =<br />
<br />
This paper<ref><br />
Mikolov, Tomas, et al. [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf "Distributed representations of words and phrases and their compositionality."] Advances in neural information processing systems. 2013.<br />
</ref> presents several extensions of the Skip-gram model introduced by Mikolov et al. <ref name=MiT> Mikolov, Tomas, ''et al'' [http://arxiv.org/pdf/1301.3781v3.pdf"Efficient Estimation of Word Representations in Vector Space"] in ICLR Workshop, (2013). </ref>. The Skip-gram model is an efficient method for learning high-quality vector representations of words from large amounts of unstructured text data. The word representations computed using this model are very interesting because the learned vectors explicitly encode many linguistic regularities and patterns. Somewhat surprisingly, many of these patterns can be represented as linear translations. For example, the result of a vector calculation vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector. The authors of this paper show that subsampling of frequent words during training results in a significant speedup and improves accuracy of the representations of less frequent words. In addition, a simplified variant of Noise Contrastive Estimation (NCE) <ref name=GuM><br />
Gutmann, Michael U, ''et al'' [http://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann12JMLR.pdf"Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics"] in The Journal ofMachine Learning Research, (2012).<br />
</ref>. for training the Skip-gram model is presented that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax that was used in the prior work <ref name=MiT></ref>. It also shows that a non-obvious degree of language understanding can be obtained by using basic mathematical operations on the word vector representations. For example, vec(“Russia”) + vec(“river”) is close to vec(“Volga River”), and vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”).<br />
<br />
= The Skip-gram Model =<br />
<br />
The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document. More formally, given a sequence of training words <math>w_1, w_2,..., w_T</math> the objective of the Skip-gram model is to maximize the average log probability:<br />
<br />
<math><br />
\frac{1}{T} \sum_{t=1}^{T} \sum_{-c\leq j\leq c} log(p(w_{t+j}|w_t))<br />
</math><br />
<br /><br />
<br /><br />
where <math>c</math> is the size of the training context (which can be a function of the center word <math>w_t</math>) and <math>p(w_{t+j}|w_t)</math> is defined using softmax function:<br />
<br />
<math><br />
p(w_O|w_I) = \frac{exp ({v'_{W_O}}^T v_{W_I})}{\sum{w=1}^{W} exp ({v'_{W}}^T v_{W_I})}<br />
</math><br />
<br />
Here, <math>v_w</math> and <math>v'_w</math> are the “''input''” and “''output''” vector representations of <math>w</math>, and <math>W</math> is the number of words in the vocabulary.<br />
<br />
== Hierarchical Softmax ==<br />
<br />
Hierarchical Softmax is a computationally efficient approximation of the full softmax <ref name=MoF><br />
Morin, Frederic, ''et al'' [http://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf"Hierarchical probabilistic neural network language model"] in Proceedings of the international workshop on artificial intelligence and statistics, (2015).<br />
</ref>. Hierarchical Softmax evaluate only about <math>log_2(W)</math> output nodes instead of evaluating <math>W</math> nodes in the neural network to obtain the probability distribution.<br />
<br />
The hierarchical softmax uses a binary tree representation of the output layer with the <math>W</math> words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes. These define a random walk that assigns probabilities to words.<br />
<br />
Let <math>n(w,j)</math> be the <math>j^{th}</math> node on the path from the root to <math>w</math>, and let <math>L(w)</math> be the length of this path, so <math>n(w,1) = root</math> and <math>n(w,L(w)) = w</math>. In addition, for any inner node <math>n</math>, let <math>ch(n)</math> be an arbitrary fixed child of <math>n</math> and let <math>[[x]]</math> be 1 if <math>x</math> is true and -1 otherwise. Then the hierarchical softmax defines <math>p(w_O|w_I )</math> as follows:<br />
<br />
<math><br />
p(w|w_I) = \prod_{j=1}^{L(w)-1} \sigma ([[n(w,j+1)=ch(n(w,j))]]{v'_{n(w,j)}}^T v_{W_I}) <br />
</math><br />
<br />
where<br />
<br />
<math><br />
\sigma (x)=\frac{1}{1+exp(-x)}<br />
</math><br />
<br />
In this paper, a binary Huffman tree is used as the structure for the hierarchical softmax because it assigns short codes to the frequent words which results in fast training. It has been observed before that grouping words together by their frequency works well as a very simple speedup technique for the neural network based language models <ref name=MiT></ref><ref name=MiT2><br />
Mikolov, Tomas, ''et al'' [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5947611"Extensions of recurrent neural network language model."] in Acoustics, Speech and Signal Processing (ICASSP), (2011).<br />
</ref>.<br />
<br />
== Negative Sampling==<br />
<br />
Noise Contrastive Estimation (NCE) is an alternative to the hierarchical softmax. NCE indicates that a good model should be able to differentiate data from noise by means of logistic regression. While NCE can be shown to approximately maximize the log probability of the softmax, the Skipgram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality. Negative sampling (NEG) is defined by the objective:<br />
<br />
<math><br />
log \sigma ({v'_{W_O}}^T v_{W_I})+\sum_{i=1}^{k} \mathbb{E}_{w_i\sim P_n(w)}[log \sigma ({-v'_{W_i}}^T v_{W_I})]<br />
</math><br />
<br />
The main difference between the Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples. And while NCE approximately maximizes the log probability of the softmax, this property is not important for our application.<br /><br />
Both NCE and NEG have the noise distribution <math>P_n(w)</math> as a free parameter. We investigated a number of choices for <math>P_n(w)</math> and found that the unigram distribution <math>U(w)</math> raised to the 3/4rd power (i.e., <math>U(w)^{3/4}/Z)</math> outperformed significantly the unigram and the uniform distributions, for both NCE and NEG on every task we tried including language modeling.<br />
<br />
==Subsampling of Frequent Words==<br />
In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g., “in”, “the”, and “a”). Such words usually provide less information about the surrounding words that rarer words (i.e., "the" provides little information about the next word because it co-occurs with a huge number of words), and the representation of the frequent word will be unlikely to change significantly after many iterations. <br />
<br />
To counter the imbalance between the rare and frequent words, a simple subsampling approach is used. Each word <math>w_i</math> in the training set is discarded with probability computed by the formula:<br />
<br />
<math><br />
P(w_i)=1-\sqrt{\frac{t}{f(w_i)}}<br />
</math><br />
<br />
where <math>f(w_i)</math> is the frequency of word <math>w_i</math> and <math>t</math> is a chosen threshold, typically around <math>10^{−5}</math>.<br />
<br />
= Empirical Results=<br />
<br />
The Hierarchical Softmax (HS), Noise Contrastive Estimation, Negative Sampling, and subsampling of the training words are evaluated with the help of the analogical reasoning task1 <ref name=MiT></ref>. The task consists of analogies such as “Germany” : “Berlin” :: “France” : ?, which are solved by finding a vector ''x'' such that vec(''x'') is closest to vec(“Berlin”) - vec(“Germany”) + vec(“France”) according to the cosine distance. This specific example is considered to have been answered correctly if ''x'' is “Paris”. The task has two broad categories: the syntactic analogies (such as “quick” : “quickly” :: “slow” : “slowly”) and the semantic analogies, such as the country to capital city relationship.<br />
<br />
For training the Skip-gram models, a large dataset consisting of various news articles is used (an internal Google dataset with one billion words). All words that occurred less than 5 times in the training data were discarded, which resulted in a vocabulary of size 692K. The performance of various Skip-gram models on the word analogy test set is reported in Table 1. The table shows that Negative Sampling outperforms the Hierarchical Softmax on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation. The subsampling of the frequent words improves the training speed several times and makes the word representations significantly more accurate.<br />
<br />
<center><br />
[[File:Tb_1.PNG | frame | center |Table 1. Accuracy of various Skip-gram 300-dimensional models on the analogical reasoning task as defined in <ref name=MiT></ref>. NEG-''k'' stands for Negative Sampling with ''k'' negative samples for each positive sample; NCE stands for Noise Contrastive Estimation and HS-Huffman stands for the Hierarchical Softmax with the frequency-based Huffman codes. ]]<br />
</center><br />
<br />
=Learning Phrases=<br />
<br />
Many phrases have a meaning that is not a simple composition of the meanings of its individual words. To learn vector representation for phrases, we first find words that appear frequently together, and infrequently in other contexts. For example, “''New York Times''” and “''Toronto Maple Leafs''” are replaced by unique tokens in the training data, while a bigram “''this is''” will remain unchanged. This way, we can form many reasonable phrases without greatly increasing the size of the vocabulary; in theory, we can train the Skip-gram model using all n-grams, but that would be too memory intensive. A simple data-driven approach, where phrases are formed based on the unigram and bigram counts is applied to identify the phrases. In this approach, a ''score'' is calculated as:<br />
<br />
<math><br />
score(w_i,w_j)=\frac{count(w_iw_j)-\delta}{count(w_i)count(w_j)}<br />
</math><br />
<br />
The <math>\delta</math> is used as a discounting coefficient and prevents too many phrases consisting of very infrequent words to be formed. The bigrams with ''scores'' above the chosen threshold are then used as phrases. The quality of the phrase representations is evaluated using a new analogical reasoning task that involves phrases. Table 2 shows examples of the five categories of analogies used in this task.<br />
<br />
<center><br />
[[File:Tb_2.PNG | frame | center |Table 2. Examples of the analogical reasoning task for phrases (the full test set has 3218 examples). The goal is to compute the fourth phrase using the first three. Our best model achieved an accuracy of 72% on this dataset.]]<br />
</center><br />
<br />
==Phrase Skip-Gram Results==<br />
<br />
First, the phrase based training corpus is constructed and then Skip-gram models are trained using different hyperparameters. Table 3 shows the results using vector dimensionality 300 and context size 5. This setting already achieves good performance on the phrase dataset, and allowed us to quickly compare the Negative Sampling and the Hierarchical Softmax, both with and without subsampling of the frequent tokens. The results show that while Negative Sampling achieves a respectable accuracy even with ''k = 5'', using ''k = 15'' achieves considerably better performance. Also, the subsampling can result in faster training and can also improve accuracy, at least in some cases.<br />
<br />
<center><br />
[[File:Tb_3.PNG | frame | center |Table 3. Accuracies of the Skip-gram models on the phrase analogy dataset. The models were trained on approximately one billion words from the news dataset.]]<br />
</center><br />
<br />
<br />
The amount of the training data was increased to 33 billion words in order to maximize the accuracy on the phrase analogy task. Hierarchical softmax, dimensionality of 1000, and the entire sentence for the context were used. This resulted in a model that reached an accuracy of 72%. Reducing the size of the training dataset to 6 billion caused lower accuracy (66%), which suggests that large amount of the training data is crucial. To gain further insight into how different the representations learned by different models are, nearest neighbors of infrequent phrases were inspected manually using various models. In Table 4 shows a sample of such comparison. Consistently with the previous results, it seems that the best representations of phrases are learned by a model with the hierarchical softmax and subsampling.<br />
<br />
<center><br />
[[File:Tb_4.PNG | frame | center |Table 4. Examples of the closest entities to the given short phrases, using two different models.]]<br />
</center><br />
<br />
=Additive Compositionality=<br />
<br />
The word and phrase representations learned by the Skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetics. Also, the Skip-gram representations exhibit another kind of linear structure that makes it possible to meaningfully combine words by an element-wise addition of their vector representations. This phenomenon is illustrated in Table 5. The additive property of the vectors can be explained by inspecting the training objective. The word vectors are in a linear relationship with the inputs to the softmax nonlinearity. As the word vectors are trained to predict the surrounding words in the sentence, the vectors can be seen as representing the distribution of the context in which a word appears. These values are related logarithmically to the probabilities computed by the output layer, so the sum of two word vectors is related to the product of the two context distributions. The product works here as the AND function: words that are assigned high probabilities by both word vectors will have high probability, and the other words will have low probability.<br />
<br />
<center><br />
[[File:Tb_5.PNG | frame | center |Table 5. Vector compositionality using element-wise addition. Four closest tokens to the sum of two vectors are shown, using the best Skip-gram model.]]<br />
</center><br />
<br />
=Comparison to Published Word Representations=<br />
<br />
Table 6 shows the empirical comparison between different neural network-based representations of words by showing the nearest neighbors of infrequent words. These examples show that the big Skip-gram model trained on a large corpus visibly outperforms all the other models in the quality of the learned representations. This can be attributed in part to the fact that this model has been trained on about 30 billion words, which is about two to three orders of magnitude more data than the typical size used in the prior work. Interestingly, although the training set is much larger, the training time of the Skip-gram model is just a fraction of the time complexity required by the previous model architectures.<br />
<br />
<center><br />
[[File:Tb_6.PNG | frame | center |Table 6. Examples of the closest tokens given various well-known models and the Skip-gram model trained on phrases using over 30 billion training words. An empty cell means that the word was not in the vocabulary.]]<br />
</center><br />
<br />
=Conclusion=<br />
<br />
This work has the following key contributions:<br />
<br />
1. This work shows how to train distributed representations of words and phrases with the Skip-gram model and demonstrate that these representations exhibit linear structure that makes precise analogical reasoning possible.<br />
<br />
2. It is a computationally efficient model architecture which results in successfully train models on several orders of magnitude more data than the previously published models.<br />
<br />
3. Introducing Negative sampling algorithm, which is an extremely simple training method that learns accurate representations especially for frequent words.<br />
<br />
4. The choice of the training algorithm and the hyper-parameter selection is a task specific decision. It is shown that the most crucial decisions that affect the performance are the choice of the model architecture, the size of the vectors, the subsampling rate, and the size of the training window.<br />
<br />
5. The word vectors can be meaningfully combined using just simple vector addition. Another approach for learning representations of phrases presented in this paper is to simply represent the phrases with a single token. Combining these two approaches gives a powerful yet simple way how to represent longer pieces of text, while having minimal computational complexity.<br />
<br />
Le et al have used the idea of the current paper for learning paragraph vectors. In that later work they used paragraph vectors for prediction of the next word. Every word and also every paragraph is mapped to a unique vector in represented in a column of two different matrices W and D. Then paragraph vectors and word Le Q, Mikolov T. vectors are concatenated to contribute for predicting the next word [http://arxiv.org/pdf/1405.4053v2.pdf ]<ref><br />
Distributed Representations of Sentences and Documents<br />
</ref>. <br />
<br />
= Recursive Autoencoder =<br />
<br />
This is taken from paper 'Semi-supervised recursive autoencoders for predicting sentiment distributions'.<ref> Socher, et al. [http://www.socher.org/uploads/Main/SocherPenningtonHuangNgManning_EMNLP2011.pdf] </ref><br />
=== Other techniques for sentence representation ===<br />
<br />
The idea of Recursive Autoencoder is summarized in the figure below. The example illustrates the recursive autoencoder to a binary tree.<br />
<center><br />
[[File:Recur-auto.png]]<br />
</center><br />
<br />
Assume given a list of word vectors <math> x = (x_1, ..., x_m)</math>, we need to branch triplets of parents with children: <math> (y_1 \rightarrow x_3x_4), (y_2 \rightarrow x_2y_1), (y_3 \rightarrow x_1y_2) </math>.<br />
<br />
The first parent <math> y_1 </math> is computed from the children <math> (c_1, c_2) = (x_3, x_4)</math>: <math> p=f(W^{(1)}[c_1; c_2] + b^{(1)})</math> , where W is the parameter matrix and b is bias term. <br />
<br />
The autoencoder comes in by reconstructing children set <math> [c_1^'; c_2^'] = W^{(2)}p + b^{(2)}</math>. The object of this method is to minimized the MSE of original children set and the reconstructed children set.<br />
<br />
=Resources=<br />
<br />
The code for training the word and phrase vectors based on this paper is available in the open source project [https://code.google.com/p/word2vec/ word2vec]. This project also contains a set of pre-trained 300-dimensional vectors for 3 million words and phrases.<br />
<br />
=References=<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=distributed_Representations_of_Words_and_Phrases_and_their_Compositionality&diff=27325distributed Representations of Words and Phrases and their Compositionality2015-12-15T20:48:23Z<p>Mgohari2: </p>
<hr />
<div>= Introduction =<br />
<br />
This paper<ref><br />
Mikolov, Tomas, et al. [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf "Distributed representations of words and phrases and their compositionality."] Advances in neural information processing systems. 2013.<br />
</ref> presents several extensions of the Skip-gram model introduced by Mikolov et al. <ref name=MiT> Mikolov, Tomas, ''et al'' [http://arxiv.org/pdf/1301.3781v3.pdf"Efficient Estimation of Word Representations in Vector Space"] in ICLR Workshop, (2013). </ref>. The Skip-gram model is an efficient method for learning high-quality vector representations of words from large amounts of unstructured text data. The word representations computed using this model are very interesting because the learned vectors explicitly encode many linguistic regularities and patterns. Somewhat surprisingly, many of these patterns can be represented as linear translations. For example, the result of a vector calculation vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector. The authors of this paper show that subsampling of frequent words during training results in a significant speedup and improves accuracy of the representations of less frequent words. In addition, a simplified variant of Noise Contrastive Estimation (NCE) <ref name=GuM><br />
Gutmann, Michael U, ''et al'' [http://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann12JMLR.pdf"Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics"] in The Journal ofMachine Learning Research, (2012).<br />
</ref>. for training the Skip-gram model is presented that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax that was used in the prior work <ref name=MiT></ref>. It also shows that a non-obvious degree of language understanding can be obtained by using basic mathematical operations on the word vector representations. For example, vec(“Russia”) + vec(“river”) is close to vec(“Volga River”), and vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”).<br />
<br />
= The Skip-gram Model =<br />
<br />
The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document. More formally, given a sequence of training words <math>w_1, w_2,..., w_T</math> the objective of the Skip-gram model is to maximize the average log probability:<br />
<br />
<math><br />
\frac{1}{T} \sum_{t=1}^{T} \sum_{-c\leq j\leq c} log(p(w_{t+j}|w_t))<br />
</math><br />
<br /><br />
<br /><br />
where <math>c</math> is the size of the training context (which can be a function of the center word <math>w_t</math>) and <math>p(w_{t+j}|w_t)</math> is defined using softmax function:<br />
<br />
<math><br />
p(w_O|w_I) = \frac{exp ({v'_{W_O}}^T v_{W_I})}{\sum{w=1}^{W} exp ({v'_{W}}^T v_{W_I})}<br />
</math><br />
<br />
Here, <math>v_w</math> and <math>v'_w</math> are the “''input''” and “''output''” vector representations of <math>w</math>, and <math>W</math> is the number of words in the vocabulary.<br />
<br />
== Hierarchical Softmax ==<br />
<br />
Hierarchical Softmax is a computationally efficient approximation of the full softmax <ref name=MoF><br />
Morin, Frederic, ''et al'' [http://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf"Hierarchical probabilistic neural network language model"] in Proceedings of the international workshop on artificial intelligence and statistics, (2015).<br />
</ref>. Hierarchical Softmax evaluate only about <math>log_2(W)</math> output nodes instead of evaluating <math>W</math> nodes in the neural network to obtain the probability distribution.<br />
<br />
The hierarchical softmax uses a binary tree representation of the output layer with the <math>W</math> words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes. These define a random walk that assigns probabilities to words.<br />
<br />
Let <math>n(w,j)</math> be the <math>j^{th}</math> node on the path from the root to <math>w</math>, and let <math>L(w)</math> be the length of this path, so <math>n(w,1) = root</math> and <math>n(w,L(w)) = w</math>. In addition, for any inner node <math>n</math>, let <math>ch(n)</math> be an arbitrary fixed child of <math>n</math> and let <math>[[x]]</math> be 1 if <math>x</math> is true and -1 otherwise. Then the hierarchical softmax defines <math>p(w_O|w_I )</math> as follows:<br />
<br />
<math><br />
p(w|w_I) = \prod_{j=1}^{L(w)-1} \sigma ([[n(w,j+1)=ch(n(w,j))]]{v'_{n(w,j)}}^T v_{W_I}) <br />
</math><br />
<br />
where<br />
<br />
<math><br />
\sigma (x)=\frac{1}{1+exp(-x)}<br />
</math><br />
<br />
In this paper, a binary Huffman tree is used as the structure for the hierarchical softmax because it assigns short codes to the frequent words which results in fast training. It has been observed before that grouping words together by their frequency works well as a very simple speedup technique for the neural network based language models <ref name=MiT></ref><ref name=MiT2><br />
Mikolov, Tomas, ''et al'' [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5947611"Extensions of recurrent neural network language model."] in Acoustics, Speech and Signal Processing (ICASSP), (2011).<br />
</ref>.<br />
<br />
== Negative Sampling==<br />
<br />
Noise Contrastive Estimation (NCE) is an alternative to the hierarchical softmax. NCE indicates that a good model should be able to differentiate data from noise by means of logistic regression. While NCE can be shown to approximately maximize the log probability of the softmax, the Skipgram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality. Negative sampling (NEG) is defined by the objective:<br />
<br />
<math><br />
log \sigma ({v'_{W_O}}^T v_{W_I})+\sum_{i=1}^{k} \mathbb{E}_{w_i\sim P_n(w)}[log \sigma ({-v'_{W_i}}^T v_{W_I})]<br />
</math><br />
<br />
The main difference between the Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples. And while NCE approximately maximizes the log probability of the softmax, this property is not important for our application.<br /><br />
Both NCE and NEG have the noise distribution <math>P_n(w)</math> as a free parameter. We investigated a number of choices for <math>P_n(w)</math> and found that the unigram distribution <math>U(w)</math> raised to the 3/4rd power (i.e., <math>U(w)^{3/4}/Z)</math> outperformed significantly the unigram and the uniform distributions, for both NCE and NEG on every task we tried including language modeling.<br />
<br />
==Subsampling of Frequent Words==<br />
In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g., “in”, “the”, and “a”). Such words usually provide less information about the surrounding words that rarer words (i.e., "the" provides little information about the next word because it co-occurs with a huge number of words), and the representation of the frequent word will be unlikely to change significantly after many iterations. <br />
<br />
To counter the imbalance between the rare and frequent words, a simple subsampling approach is used. Each word <math>w_i</math> in the training set is discarded with probability computed by the formula:<br />
<br />
<math><br />
P(w_i)=1-\sqrt{\frac{t}{f(w_i)}}<br />
</math><br />
<br />
where <math>f(w_i)</math> is the frequency of word <math>w_i</math> and <math>t</math> is a chosen threshold, typically around <math>10^{−5}</math>.<br />
<br />
= Empirical Results=<br />
<br />
The Hierarchical Softmax (HS), Noise Contrastive Estimation, Negative Sampling, and subsampling of the training words are evaluated with the help of the analogical reasoning task1 <ref name=MiT></ref>. The task consists of analogies such as “Germany” : “Berlin” :: “France” : ?, which are solved by finding a vector ''x'' such that vec(''x'') is closest to vec(“Berlin”) - vec(“Germany”) + vec(“France”) according to the cosine distance. This specific example is considered to have been answered correctly if ''x'' is “Paris”. The task has two broad categories: the syntactic analogies (such as “quick” : “quickly” :: “slow” : “slowly”) and the semantic analogies, such as the country to capital city relationship.<br />
<br />
For training the Skip-gram models, a large dataset consisting of various news articles is used (an internal Google dataset with one billion words). All words that occurred less than 5 times in the training data were discarded, which resulted in a vocabulary of size 692K. The performance of various Skip-gram models on the word analogy test set is reported in Table 1. The table shows that Negative Sampling outperforms the Hierarchical Softmax on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation. The subsampling of the frequent words improves the training speed several times and makes the word representations significantly more accurate.<br />
<br />
<center><br />
[[File:Tb_1.PNG | frame | center |Table 1. Accuracy of various Skip-gram 300-dimensional models on the analogical reasoning task as defined in <ref name=MiT></ref>. NEG-''k'' stands for Negative Sampling with ''k'' negative samples for each positive sample; NCE stands for Noise Contrastive Estimation and HS-Huffman stands for the Hierarchical Softmax with the frequency-based Huffman codes. ]]<br />
</center><br />
<br />
=Learning Phrases=<br />
<br />
Many phrases have a meaning that is not a simple composition of the meanings of its individual words. To learn vector representation for phrases, we first find words that appear frequently together, and infrequently in other contexts. For example, “''New York Times''” and “''Toronto Maple Leafs''” are replaced by unique tokens in the training data, while a bigram “''this is''” will remain unchanged. This way, we can form many reasonable phrases without greatly increasing the size of the vocabulary; in theory, we can train the Skip-gram model using all n-grams, but that would be too memory intensive. A simple data-driven approach, where phrases are formed based on the unigram and bigram counts is applied to identify the phrases. In this approach, a ''score'' is calculated as:<br />
<br />
<math><br />
score(w_i,w_j)=\frac{count(w_iw_j)-\delta}{count(w_i)count(w_j)}<br />
</math><br />
<br />
The <math>\delta</math> is used as a discounting coefficient and prevents too many phrases consisting of very infrequent words to be formed. The bigrams with ''scores'' above the chosen threshold are then used as phrases. The quality of the phrase representations is evaluated using a new analogical reasoning task that involves phrases. Table 2 shows examples of the five categories of analogies used in this task.<br />
<br />
<center><br />
[[File:Tb_2.PNG | frame | center |Table 2. Examples of the analogical reasoning task for phrases (the full test set has 3218 examples). The goal is to compute the fourth phrase using the first three. Our best model achieved an accuracy of 72% on this dataset.]]<br />
</center><br />
<br />
==Phrase Skip-Gram Results==<br />
<br />
First, the phrase based training corpus is constructed and then Skip-gram models are trained using different hyperparameters. Table 3 shows the results using vector dimensionality 300 and context size 5. This setting already achieves good performance on the phrase dataset, and allowed us to quickly compare the Negative Sampling and the Hierarchical Softmax, both with and without subsampling of the frequent tokens. The results show that while Negative Sampling achieves a respectable accuracy even with ''k = 5'', using ''k = 15'' achieves considerably better performance. Also, the subsampling can result in faster training and can also improve accuracy, at least in some cases.<br />
<br />
<center><br />
[[File:Tb_3.PNG | frame | center |Table 3. Accuracies of the Skip-gram models on the phrase analogy dataset. The models were trained on approximately one billion words from the news dataset.]]<br />
</center><br />
<br />
<br />
The amount of the training data was increased to 33 billion words in order to maximize the accuracy on the phrase analogy task. Hierarchical softmax, dimensionality of 1000, and the entire sentence for the context were used. This resulted in a model that reached an accuracy of 72%. Reducing the size of the training dataset to 6 billion caused lower accuracy (66%), which suggests that large amount of the training data is crucial. To gain further insight into how different the representations learned by different models are, nearest neighbors of infrequent phrases were inspected manually using various models. In Table 4 shows a sample of such comparison. Consistently with the previous results, it seems that the best representations of phrases are learned by a model with the hierarchical softmax and subsampling.<br />
<br />
<center><br />
[[File:Tb_4.PNG | frame | center |Table 4. Examples of the closest entities to the given short phrases, using two different models.]]<br />
</center><br />
<br />
=Additive Compositionality=<br />
<br />
The word and phrase representations learned by the Skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetics. Also, the Skip-gram representations exhibit another kind of linear structure that makes it possible to meaningfully combine words by an element-wise addition of their vector representations. This phenomenon is illustrated in Table 5. The additive property of the vectors can be explained by inspecting the training objective. The word vectors are in a linear relationship with the inputs to the softmax nonlinearity. As the word vectors are trained to predict the surrounding words in the sentence, the vectors can be seen as representing the distribution of the context in which a word appears. These values are related logarithmically to the probabilities computed by the output layer, so the sum of two word vectors is related to the product of the two context distributions. The product works here as the AND function: words that are assigned high probabilities by both word vectors will have high probability, and the other words will have low probability.<br />
<br />
<center><br />
[[File:Tb_5.PNG | frame | center |Table 5. Vector compositionality using element-wise addition. Four closest tokens to the sum of two vectors are shown, using the best Skip-gram model.]]<br />
</center><br />
<br />
=Comparison to Published Word Representations=<br />
<br />
Table 6 shows the empirical comparison between different neural network-based representations of words by showing the nearest neighbors of infrequent words. These examples show that the big Skip-gram model trained on a large corpus visibly outperforms all the other models in the quality of the learned representations. This can be attributed in part to the fact that this model has been trained on about 30 billion words, which is about two to three orders of magnitude more data than the typical size used in the prior work. Interestingly, although the training set is much larger, the training time of the Skip-gram model is just a fraction of the time complexity required by the previous model architectures.<br />
<br />
<center><br />
[[File:Tb_6.PNG | frame | center |Table 6. Examples of the closest tokens given various well-known models and the Skip-gram model trained on phrases using over 30 billion training words. An empty cell means that the word was not in the vocabulary.]]<br />
</center><br />
<br />
=Conclusion=<br />
<br />
This work has the following key contributions:<br />
<br />
1. This work shows how to train distributed representations of words and phrases with the Skip-gram model and demonstrate that these representations exhibit linear structure that makes precise analogical reasoning possible.<br />
<br />
2. It is a computationally efficient model architecture which results in successfully train models on several orders of magnitude more data than the previously published models.<br />
<br />
3. Introducing Negative sampling algorithm, which is an extremely simple training method that learns accurate representations especially for frequent words.<br />
<br />
4. The choice of the training algorithm and the hyper-parameter selection is a task specific decision. It is shown that the most crucial decisions that affect the performance are the choice of the model architecture, the size of the vectors, the subsampling rate, and the size of the training window.<br />
<br />
5. The word vectors can be meaningfully combined using just simple vector addition. Another approach for learning representations of phrases presented in this paper is to simply represent the phrases with a single token. Combining these two approaches gives a powerful yet simple way how to represent longer pieces of text, while having minimal computational complexity.<br />
<br />
Mikolov et al used a <br />
= Recursive Autoencoder =<br />
<br />
This is taken from paper 'Semi-supervised recursive autoencoders for predicting sentiment distributions'.<ref> Socher, et al. [http://www.socher.org/uploads/Main/SocherPenningtonHuangNgManning_EMNLP2011.pdf] </ref><br />
=== Other techniques for sentence representation ===<br />
<br />
The idea of Recursive Autoencoder is summarized in the figure below. The example illustrates the recursive autoencoder to a binary tree.<br />
<center><br />
[[File:Recur-auto.png]]<br />
</center><br />
<br />
Assume given a list of word vectors <math> x = (x_1, ..., x_m)</math>, we need to branch triplets of parents with children: <math> (y_1 \rightarrow x_3x_4), (y_2 \rightarrow x_2y_1), (y_3 \rightarrow x_1y_2) </math>.<br />
<br />
The first parent <math> y_1 </math> is computed from the children <math> (c_1, c_2) = (x_3, x_4)</math>: <math> p=f(W^{(1)}[c_1; c_2] + b^{(1)})</math> , where W is the parameter matrix and b is bias term. <br />
<br />
The autoencoder comes in by reconstructing children set <math> [c_1^'; c_2^'] = W^{(2)}p + b^{(2)}</math>. The object of this method is to minimized the MSE of original children set and the reconstructed children set.<br />
<br />
=Resources=<br />
<br />
The code for training the word and phrase vectors based on this paper is available in the open source project [https://code.google.com/p/word2vec/ word2vec]. This project also contains a set of pre-trained 300-dimensional vectors for 3 million words and phrases.<br />
<br />
=References=<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=distributed_Representations_of_Words_and_Phrases_and_their_Compositionality&diff=27324distributed Representations of Words and Phrases and their Compositionality2015-12-15T20:43:57Z<p>Mgohari2: </p>
<hr />
<div>= Introduction =<br />
<br />
This paper<ref><br />
Mikolov, Tomas, et al. [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf "Distributed representations of words and phrases and their compositionality."] Advances in neural information processing systems. 2013.<br />
</ref> presents several extensions of the Skip-gram model introduced by Mikolov et al. <ref name=MiT> Mikolov, Tomas, ''et al'' [http://arxiv.org/pdf/1301.3781v3.pdf"Efficient Estimation of Word Representations in Vector Space"] in ICLR Workshop, (2013). </ref>. The Skip-gram model is an efficient method for learning high-quality vector representations of words from large amounts of unstructured text data. The word representations computed using this model are very interesting because the learned vectors explicitly encode many linguistic regularities and patterns. Somewhat surprisingly, many of these patterns can be represented as linear translations. For example, the result of a vector calculation vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector. The authors of this paper show that subsampling of frequent words during training results in a significant speedup and improves accuracy of the representations of less frequent words. In addition, a simplified variant of Noise Contrastive Estimation (NCE) <ref name=GuM><br />
Gutmann, Michael U, ''et al'' [http://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann12JMLR.pdf"Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics"] in The Journal ofMachine Learning Research, (2012).<br />
</ref>. for training the Skip-gram model is presented that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax that was used in the prior work <ref name=MiT></ref>. It also shows that a non-obvious degree of language understanding can be obtained by using basic mathematical operations on the word vector representations. For example, vec(“Russia”) + vec(“river”) is close to vec(“Volga River”), and vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”).<br />
<br />
= The Skip-gram Model =<br />
<br />
The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document. More formally, given a sequence of training words <math>w_1, w_2,..., w_T</math> the objective of the Skip-gram model is to maximize the average log probability:<br />
<br />
<math><br />
\frac{1}{T} \sum_{t=1}^{T} \sum_{-c\leq j\leq c} log(p(w_{t+j}|w_t))<br />
</math><br />
<br /><br />
<br /><br />
where <math>c</math> is the size of the training context (which can be a function of the center word <math>w_t</math>) and <math>p(w_{t+j}|w_t)</math> is defined using softmax function:<br />
<br />
<math><br />
p(w_O|w_I) = \frac{exp ({v'_{W_O}}^T v_{W_I})}{\sum{w=1}^{W} exp ({v'_{W}}^T v_{W_I})}<br />
</math><br />
<br />
Here, <math>v_w</math> and <math>v'_w</math> are the “''input''” and “''output''” vector representations of <math>w</math>, and <math>W</math> is the number of words in the vocabulary.<br />
<br />
== Hierarchical Softmax ==<br />
<br />
Hierarchical Softmax is a computationally efficient approximation of the full softmax <ref name=MoF><br />
Morin, Frederic, ''et al'' [http://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf"Hierarchical probabilistic neural network language model"] in Proceedings of the international workshop on artificial intelligence and statistics, (2015).<br />
</ref>. Hierarchical Softmax evaluate only about <math>log_2(W)</math> output nodes instead of evaluating <math>W</math> nodes in the neural network to obtain the probability distribution.<br />
<br />
The hierarchical softmax uses a binary tree representation of the output layer with the <math>W</math> words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes. These define a random walk that assigns probabilities to words.<br />
<br />
Let <math>n(w,j)</math> be the <math>j^{th}</math> node on the path from the root to <math>w</math>, and let <math>L(w)</math> be the length of this path, so <math>n(w,1) = root</math> and <math>n(w,L(w)) = w</math>. In addition, for any inner node <math>n</math>, let <math>ch(n)</math> be an arbitrary fixed child of <math>n</math> and let <math>[[x]]</math> be 1 if <math>x</math> is true and -1 otherwise. Then the hierarchical softmax defines <math>p(w_O|w_I )</math> as follows:<br />
<br />
<math><br />
p(w|w_I) = \prod_{j=1}^{L(w)-1} \sigma ([[n(w,j+1)=ch(n(w,j))]]{v'_{n(w,j)}}^T v_{W_I}) <br />
</math><br />
<br />
where<br />
<br />
<math><br />
\sigma (x)=\frac{1}{1+exp(-x)}<br />
</math><br />
<br />
In this paper, a binary Huffman tree is used as the structure for the hierarchical softmax because it assigns short codes to the frequent words which results in fast training. It has been observed before that grouping words together by their frequency works well as a very simple speedup technique for the neural network based language models <ref name=MiT></ref><ref name=MiT2><br />
Mikolov, Tomas, ''et al'' [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5947611"Extensions of recurrent neural network language model."] in Acoustics, Speech and Signal Processing (ICASSP), (2011).<br />
</ref>.<br />
<br />
== Negative Sampling==<br />
<br />
Noise Contrastive Estimation (NCE) is an alternative to the hierarchical softmax. NCE indicates that a good model should be able to differentiate data from noise by means of logistic regression. While NCE can be shown to approximately maximize the log probability of the softmax, the Skipgram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality. Negative sampling (NEG) is defined by the objective:<br />
<br />
<math><br />
log \sigma ({v'_{W_O}}^T v_{W_I})+\sum_{i=1}^{k} \mathbb{E}_{w_i\sim P_n(w)}[log \sigma ({-v'_{W_i}}^T v_{W_I})]<br />
</math><br />
<br />
The main difference between the Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples. And while NCE approximately maximizes the log probability of the softmax, this property is not important for our application.<br /><br />
Both NCE and NEG have the noise distribution <math>P_n(w)</math> as a free parameter. We investigated a number of choices for <math>P_n(w)</math> and found that the unigram distribution <math>U(w)</math> raised to the 3/4rd power (i.e., <math>U(w)^{3/4}/Z)</math> outperformed significantly the unigram and the uniform distributions, for both NCE and NEG on every task we tried including language modeling.<br />
<br />
==Subsampling of Frequent Words==<br />
In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g., “in”, “the”, and “a”). Such words usually provide less information about the surrounding words that rarer words (i.e., "the" provides little information about the next word because it co-occurs with a huge number of words), and the representation of the frequent word will be unlikely to change significantly after many iterations. <br />
<br />
To counter the imbalance between the rare and frequent words, a simple subsampling approach is used. Each word <math>w_i</math> in the training set is discarded with probability computed by the formula:<br />
<br />
<math><br />
P(w_i)=1-\sqrt{\frac{t}{f(w_i)}}<br />
</math><br />
<br />
where <math>f(w_i)</math> is the frequency of word <math>w_i</math> and <math>t</math> is a chosen threshold, typically around <math>10^{−5}</math>.<br />
<br />
= Empirical Results=<br />
<br />
The Hierarchical Softmax (HS), Noise Contrastive Estimation, Negative Sampling, and subsampling of the training words are evaluated with the help of the analogical reasoning task1 <ref name=MiT></ref>. The task consists of analogies such as “Germany” : “Berlin” :: “France” : ?, which are solved by finding a vector ''x'' such that vec(''x'') is closest to vec(“Berlin”) - vec(“Germany”) + vec(“France”) according to the cosine distance. This specific example is considered to have been answered correctly if ''x'' is “Paris”. The task has two broad categories: the syntactic analogies (such as “quick” : “quickly” :: “slow” : “slowly”) and the semantic analogies, such as the country to capital city relationship.<br />
<br />
For training the Skip-gram models, a large dataset consisting of various news articles is used (an internal Google dataset with one billion words). All words that occurred less than 5 times in the training data were discarded, which resulted in a vocabulary of size 692K. The performance of various Skip-gram models on the word analogy test set is reported in Table 1. The table shows that Negative Sampling outperforms the Hierarchical Softmax on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation. The subsampling of the frequent words improves the training speed several times and makes the word representations significantly more accurate.<br />
<br />
<center><br />
[[File:Tb_1.PNG | frame | center |Table 1. Accuracy of various Skip-gram 300-dimensional models on the analogical reasoning task as defined in <ref name=MiT></ref>. NEG-''k'' stands for Negative Sampling with ''k'' negative samples for each positive sample; NCE stands for Noise Contrastive Estimation and HS-Huffman stands for the Hierarchical Softmax with the frequency-based Huffman codes. ]]<br />
</center><br />
<br />
=Learning Phrases=<br />
<br />
Many phrases have a meaning that is not a simple composition of the meanings of its individual words. To learn vector representation for phrases, we first find words that appear frequently together, and infrequently in other contexts. For example, “''New York Times''” and “''Toronto Maple Leafs''” are replaced by unique tokens in the training data, while a bigram “''this is''” will remain unchanged. This way, we can form many reasonable phrases without greatly increasing the size of the vocabulary; in theory, we can train the Skip-gram model using all n-grams, but that would be too memory intensive. A simple data-driven approach, where phrases are formed based on the unigram and bigram counts is applied to identify the phrases. In this approach, a ''score'' is calculated as:<br />
<br />
<math><br />
score(w_i,w_j)=\frac{count(w_iw_j)-\delta}{count(w_i)count(w_j)}<br />
</math><br />
<br />
The <math>\delta</math> is used as a discounting coefficient and prevents too many phrases consisting of very infrequent words to be formed. The bigrams with ''scores'' above the chosen threshold are then used as phrases. The quality of the phrase representations is evaluated using a new analogical reasoning task that involves phrases. Table 2 shows examples of the five categories of analogies used in this task.<br />
<br />
<center><br />
[[File:Tb_2.PNG | frame | center |Table 2. Examples of the analogical reasoning task for phrases (the full test set has 3218 examples). The goal is to compute the fourth phrase using the first three. Our best model achieved an accuracy of 72% on this dataset.]]<br />
</center><br />
<br />
==Phrase Skip-Gram Results==<br />
<br />
First, the phrase based training corpus is constructed and then Skip-gram models are trained using different hyperparameters. Table 3 shows the results using vector dimensionality 300 and context size 5. This setting already achieves good performance on the phrase dataset, and allowed us to quickly compare the Negative Sampling and the Hierarchical Softmax, both with and without subsampling of the frequent tokens. The results show that while Negative Sampling achieves a respectable accuracy even with ''k = 5'', using ''k = 15'' achieves considerably better performance. Also, the subsampling can result in faster training and can also improve accuracy, at least in some cases.<br />
<br />
<center><br />
[[File:Tb_3.PNG | frame | center |Table 3. Accuracies of the Skip-gram models on the phrase analogy dataset. The models were trained on approximately one billion words from the news dataset.]]<br />
</center><br />
<br />
<br />
The amount of the training data was increased to 33 billion words in order to maximize the accuracy on the phrase analogy task. Hierarchical softmax, dimensionality of 1000, and the entire sentence for the context were used. This resulted in a model that reached an accuracy of 72%. Reducing the size of the training dataset to 6 billion caused lower accuracy (66%), which suggests that large amount of the training data is crucial. To gain further insight into how different the representations learned by different models are, nearest neighbors of infrequent phrases were inspected manually using various models. In Table 4 shows a sample of such comparison. Consistently with the previous results, it seems that the best representations of phrases are learned by a model with the hierarchical softmax and subsampling.<br />
<br />
<center><br />
[[File:Tb_4.PNG | frame | center |Table 4. Examples of the closest entities to the given short phrases, using two different models.]]<br />
</center><br />
<br />
=Additive Compositionality=<br />
<br />
The word and phrase representations learned by the Skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetics. Also, the Skip-gram representations exhibit another kind of linear structure that makes it possible to meaningfully combine words by an element-wise addition of their vector representations. This phenomenon is illustrated in Table 5. The additive property of the vectors can be explained by inspecting the training objective. The word vectors are in a linear relationship with the inputs to the softmax nonlinearity. As the word vectors are trained to predict the surrounding words in the sentence, the vectors can be seen as representing the distribution of the context in which a word appears. These values are related logarithmically to the probabilities computed by the output layer, so the sum of two word vectors is related to the product of the two context distributions. The product works here as the AND function: words that are assigned high probabilities by both word vectors will have high probability, and the other words will have low probability.<br />
<br />
<center><br />
[[File:Tb_5.PNG | frame | center |Table 5. Vector compositionality using element-wise addition. Four closest tokens to the sum of two vectors are shown, using the best Skip-gram model.]]<br />
</center><br />
<br />
=Comparison to Published Word Representations=<br />
<br />
Table 6 shows the empirical comparison between different neural network-based representations of words by showing the nearest neighbors of infrequent words. These examples show that the big Skip-gram model trained on a large corpus visibly outperforms all the other models in the quality of the learned representations. This can be attributed in part to the fact that this model has been trained on about 30 billion words, which is about two to three orders of magnitude more data than the typical size used in the prior work. Interestingly, although the training set is much larger, the training time of the Skip-gram model is just a fraction of the time complexity required by the previous model architectures.<br />
<br />
<center><br />
[[File:Tb_6.PNG | frame | center |Table 6. Examples of the closest tokens given various well-known models and the Skip-gram model trained on phrases using over 30 billion training words. An empty cell means that the word was not in the vocabulary.]]<br />
</center><br />
<br />
=Conclusion=<br />
<br />
This work has the following key contributions:<br />
<br />
1. This work shows how to train distributed representations of words and phrases with the Skip-gram model and demonstrate that these representations exhibit linear structure that makes precise analogical reasoning possible.<br />
<br />
2. It is a computationally efficient model architecture which results in successfully train models on several orders of magnitude more data than the previously published models.<br />
<br />
3. Introducing Negative sampling algorithm, which is an extremely simple training method that learns accurate representations especially for frequent words.<br />
<br />
4. The choice of the training algorithm and the hyper-parameter selection is a task specific decision. It is shown that the most crucial decisions that affect the performance are the choice of the model architecture, the size of the vectors, the subsampling rate, and the size of the training window.<br />
<br />
5. The word vectors can be meaningfully combined using just simple vector addition. Another approach for learning representations of phrases presented in this paper is to simply represent the phrases with a single token. Combining these two approaches gives a powerful yet simple way how to represent longer pieces of text, while having minimal computational complexity.<br />
<br />
= Recursive Autoencoder =<br />
<br />
This is taken from paper 'Semi-supervised recursive autoencoders for predicting sentiment distributions'.<ref> Socher, et al. [http://www.socher.org/uploads/Main/SocherPenningtonHuangNgManning_EMNLP2011.pdf] </ref><br />
=== Other techniques for sentence representation ===<br />
<br />
The idea of Recursive Autoencoder is summarized in the figure below. The example illustrates the recursive autoencoder to a binary tree.<br />
<center><br />
[[File:Recur-auto.png]]<br />
</center><br />
<br />
Assume given a list of word vectors <math> x = (x_1, ..., x_m)</math>, we need to branch triplets of parents with children: <math> (y_1 \rightarrow x_3x_4), (y_2 \rightarrow x_2y_1), (y_3 \rightarrow x_1y_2) </math>.<br />
<br />
The first parent <math> y_1 </math> is computed from the children <math> (c_1, c_2) = (x_3, x_4)</math>: <math> p=f(W^{(1)}[c_1; c_2] + b^{(1)})</math> , where W is the parameter matrix and b is bias term. <br />
<br />
The autoencoder comes in by reconstructing children set <math> [c_1^'; c_2^'] = W^{(2)}p + b^{(2)}</math>. The object of this method is to minimized the MSE of original children set and the reconstructed children set.<br />
<br />
=Resources=<br />
<br />
The code for training the word and phrase vectors based on this paper is available in the open source project [https://code.google.com/p/word2vec/ word2vec]. This project also contains a set of pre-trained 300-dimensional vectors for 3 million words and phrases.<br />
<br />
=References=<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=continuous_space_language_models&diff=27321continuous space language models2015-12-15T15:26:35Z<p>Mgohari2: </p>
<hr />
<div>= Introduction =<br />
This paper describes the use of a neural network language model for large vocabulary continuous speech recognition.<br />
The underlying idea of this approach is to attack the data sparseness problem by performing the language model probability<br />
estimation in a continuous space. Highly efficient learning algorithms are described that enable the use of training<br />
corpora of several hundred million words. It is also shown that this approach can be incorporated into a large vocabulary<br />
continuous speech recognizer using a lattice re scoring framework at a very low additional processing time<br />
<br />
<br />
In certain fields of study such as speech recognition or machine translation, for some acoustic signal <math>\,x</math> or the source sentence to be translated <math>\,e</math>, it is common to model these problems as finding the sequence of words <math>\,w^*</math> that has the highest probability of occurring given <math>\,x</math> or <math>\,e</math>. This can be written as:<br />
<br />
<math>w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w)</math><br />
<br />
An acoustic or translation model can then be used for <math>\,P(x|w)</math>, similar to the idea behind LDA and QDA, and it remains to create a language model <math>\,P(w)</math> to estimate the probability of any sequence of words <math>\,w</math>.<br />
<br />
This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate <math>\,P(w)</math>.<br />
<br />
= Back-off n-grams Model =<br />
<br />
A sequence of words will be defined as <math>\,w^i_1=(w_1,w_2,\dots,w_i)</math> and the formula for the probability <math>\,P(w)</math> can be rewritten as:<br />
<br />
<math>P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1)</math><br />
<br />
It is common to estimate <math>\,P(w_i|w^{i-1}_1)</math> through:<br />
<br />
<math>\,P(w_i|w^{i-1}_1)\approx\frac{\mbox{number of occurrence of the sequence} (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence} (w_1,\dots,w_{i-1})}</math><br />
<br />
However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.<br />
<br />
To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:<br />
<br />
<math>\,P(w_i|w^{i-1}_1) = \begin{cases} <br />
\frac{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_{i-1})}, & \mbox{if number of occurrence of}\ (w_1,\dots,w_i)\ \mbox{is greater than some constant K} \\<br />
\alpha P(w_i|w^{i-1}_2), & \mbox{otherwise} <br />
\end{cases}</math><br />
<br />
<math>\,\alpha</math> is typically a discounting factor that is less than 1 to account for the lack of direct data. It usually depends on the word sequence.<br />
<br />
The general algorithm is then, if the data set does contain the sequence then calculate probability directly. Otherwise, apply a discounting factor and calculate the conditional probability with the first word in the sequence removed. For example, if the word sequence was "The dog barked" and it did not exist in the training set then the formula would be written as:<br />
<br />
<math>\,P(\mbox{barked}|\mbox{the,dog}) \approx \alpha P(\mbox{barked}|\mbox{dog})</math><br />
<br />
= Model =<br />
The neural network language model has to perform two tasks: first, project all words of the context<br />
<math>\,h_j</math> = <math>\,w_{j-n+1}^{j-1}</math> onto a continuous space, and second, calculate the language model probability <math>P(w_{j}=i|h_{j})</math>. <br />
The researchers for this paper sought to find a better model for this probability than the back-off n-grams model. Their approach was to map the n-1 words sequence onto a multi-dimension continuous space using a layer of neural network followed by another layer to estimate the probabilities of all possible next words. The formulas and model goes as follows:<br />
<br />
For some sequence of n-1 words, encode each word using 1 of K encoding, i.e. 1 where the word is indexed and zero everywhere else. Label each 1 of K encoding by <math>(w_{j-n+1},\dots,w_j)</math> for some n-1 word sequence at the j'th word in some larger context.<br />
<br />
Let P be a projection matrix common to all n-1 words and let<br />
<br />
<math>\,a_i=Pw_{j-n+i},i=1,\dots,n-1</math><br />
<br />
Let H be the weight matrix from the projection layer to the hidden layer and the state of H would be:<br />
<math>\,h=tanh(Ha + b)</math> where A is the concatenation of all <math>\,a_i</math> and <math>\,b</math> is some bias vector<br />
<br />
Finally, the output vector would be:<br />
<br />
<math>\,o=Vh+k</math> where V is the weight matrix from hidden to output and k is another bias vector. <math>\,o</math> would be a vector with same dimensions as the total vocabulary size and the probabilities can be calculated from <math>\,o</math> by applying the softmax function.<br />
<br />
The following figure shows the Architecture of the neural network language model. <math>\,h_j</math> denotes the context <math>\,w_{j-n+1}^{j-1}</math>. P is the size of one projection and H and N is the<br />
size of the second hidden and output layer, respectively. When short-lists are used the size of the output layer is much smaller than the size<br />
of the vocabulary.<br />
<br />
[[File:Q3.png]]<br />
<br />
In contrast to standard langua[[File:Qq.png]]ge modeling where we want to know the probability of a word i given its<br />
context, <math>P(w_{j} = i|h_{j}) </math>, the neural network simultaneously predicts the language model probability of all words<br />
in the word list:<br />
<br />
[[File:Q4.png]]<br />
<br />
= Optimization and Training =<br />
The training was done with standard back-propagation on minimizing the error function:<br />
<br />
<math>\,E=\sum_{i=1}^N t_i\ log p_i + \epsilon(\sum_{i,j}h^2_{ij}+\sum_{i,j}v^2_{ij})</math><br />
<br />
<math>\,t_i</math> is the desired output vector and the summations inside the epsilon bracket are regularization terms to prevent overfitting of <math>\,H</math> and <math>\,V</math>.<br />
<br />
The researchers used stochastic gradient descent to prevent having to sum over millions of examples worth of error and this sped up training time.<br />
<br />
An issue the researchers ran into using this model was that it took a long time to calculate language model probabilities compared to traditional back-off n-grams model and reduced its suitability for real time predictions. To solve this issue, several optimization techniques were used.<br />
<br />
===Lattice rescoring===<br />
<br />
It is common to keep track of additional possible solutions instead of just the most obviously likely solution in a lattice structure, i.e. a tree like structure where branches can merge and each branch represents a possible solution. For example from the paper using a tri-gram model, i.e. predict third word from first two words, the following lattice structure was formed:<br />
<br />
[[File:Lattice.PNG]]<br />
<br />
Any particular branch where two nodes have the same words can be merged. For example, "a,problem" was merged in the middle of the lattice because the tri-gram model would estimate the same probability at the point for both branch. Similary, "that_is,not" and "there_is,not" cannot be merged before the preceding two words to predict with are different.<br />
<br />
After this structure is created with a traditional back-off n-grams model, the neural network is then used to re-score the lattice and the re-scored lattice is used to make predictions.<br />
<br />
===Short List===<br />
<br />
In any language, there is usually a small set of commonly used words that form almost all of written or spoken thought. The short-list idea is that rather than calculating every single probability for even the rarest words, the neural network only calculates a small subset of the most common words. This way, the output vector can be significantly shrunk from <math>\,\mbox{N}</math> to some much smaller number <math>\,\mbox{S}</math>.<br />
<br />
If any rare words do occur, their probabilities are calculated using the traditional back-off n-grams model. The formula then goes as follows from the paper:<br />
<br />
[[File:shortlist.PNG]]<br />
<br />
Where L is the event that <math>\,w_t</math> is in the short-list.<br />
<br />
===Sorting and Bunch===<br />
<br />
The neural network predicts all the probabilities based on some sequence of words. If the probability of two different sequences of words are required but their relationship is such that for sequence 1, <math>\,w=(w_1,\dots,w_{i-1},w_i)</math> and sequence 2, <math>\,w^'=(w_1,\dots,w_{i-1},w^'_i)</math>, they differ only in the last word. Then only a single feed through the neural network is required. This is because the output vector using the context <math>\,(w_1,\dots,w_{i-1})</math> would predict the probabilities for both <math>\,w_i</math> and <math>\,w^'_i</math> being next. Therefore it is efficient to merge any sequence who have the same context.<br />
<br />
Modern day computers are also very optimized for linear algebra and it is more efficient to run multiple examples at the same time through the matrix equations. The researchers called this bunching and simple testing showed that this decreased processing time by a factor of 10 when using 128 examples at once compared to 1.<br />
<br />
= Training and Usage =<br />
<br />
The researchers used numerous optimization techniques during training and their results were summarized in the paper as follows:<br />
<br />
[[File:fast_training.PNG]]<br />
<br />
Since the model only trains to predict based on the last n-1 words, at certain points there will be less than n-1 words and adjustments must be made. The researchers considered two possibilities, using traditional models for these n-grams or filling up the n-k words with some filler word up to n-1. After some testing, they found that requests for small n-gram probabilities were pretty low and they decided to use traditional back-off n-gram model for these cases.<br />
<br />
= Results =<br />
<br />
In general the results were quite good. When this neural network + back-off n-grams hybrid was used in combination with a number of acoustic speech recognition models, they found that perplexity, lower the better, decreased by about 10% in a number of cases compared with traditional back-off n-grams only model. Some of their results are summarized as follows:<br />
<br />
[[File:results1.PNG]]<br />
<br />
[[File:results2.PNG]]<br />
<br />
The following figure shows the word error rates on the 2003 evaluation test set for the back-off LM and the hybrid LM, trained only on CTS data (left bars for<br />
each system) and interpolated with the broadcast news LM (right bars for each system).<br />
<br />
[[File:Q6.png]]<br />
<br />
A perplexity reduction of about 9% relative is obtained independently of the size of the language model<br />
training data. This gain decreases to approximately 6% after interpolation with the back-off language model<br />
trained on the additional BN corpus of out-of domain data. It can be seen that the perplexity of the hybrid<br />
language model trained only on the CTS data is better than that of the back-off reference language model<br />
trained on all of the data (45.5 with respect to 47.5). Despite these rather small gains in perplexity, consistent<br />
word error reductions were observed.<br />
<br />
= Conclusion =<br />
<br />
This paper described the theory and an experimental evaluation of a new approach to language modeling for large vocabulary continuous speech recognition based on the idea to project the words onto a continuous space and to perform the probability estimation in this space. This method is fast to the level that the neural network language model can be used in a real-time speech recognizer. The necessary capacity of the neural network is an important issue. Three possibilities were explored: increasing the size of the hidden layer, training several networks and interpolating them together, and using large projection layers. Increasing the size of the hidden layer gave only modest improvements in word error,<br />
at the price of very long training times. In this respect, the second solution is more interesting as the networks<br />
can be trained in parallel. Large projection layers appear to be the best choice as this has little impact on the<br />
complexity during training or recognition.The neural network language model is able to cover different speaking styles, ranging from rather well formed speech with few errors (broadcast news) to very relaxed speaking with many errors in syntax and semantics (meetings and conversations). It is claimed that the combination of the developed neural network and a back-off language model can be considered as a serious alternative to the commonly used back-off language models alone.<br />
<br />
This paper also proposes to investigate new training criteria for the neural network language model. Language<br />
models are almost exclusively trained independently from the acoustic model by minimizing the perplexity<br />
on some development data, and it is well known that improvements in perplexity do not necessarily<br />
lead to reductions in the word error rate.<br />
<br />
The continuous representation of the words in the neural network language model offers new ways to perform<br />
constrained language model adaptation. For example, the continuous representation of the words can be<br />
changed so that the language model predictions are improved on some adaptation data, e.g., by moving some<br />
words closer together which appear often in similar contexts. The idea is to apply a transformation on the<br />
continuous representation of the words by adding an adaptation layer between the projection layer and the<br />
hidden layer. This layer is initialized with the identity transformation and then learned by training the neural<br />
network on the adaptation data. Several variants of this basic idea are possible, for example using shared<br />
block-wise transformations in order to reduce the number of free parameters.<br />
In comparison with back-off language models whose complexity increase exponentially with the length of context, complexity of neural network language models increase<br />
linearly with the order of the n-gram and with the size of the vocabulary. This linearly increase in parameters is an important practical advantage that<br />
enables us to consider longer span language models with a negligible increase of the memory and time complexity. <br />
<br />
<br />
The underlying idea of the continuous space language model described here is to perform the probability<br />
estimation in a continuous space. Although only neural networks were investigated in this work, the approach<br />
is not inherently limited to this type of probability estimator. Other promising candidates include Gaussian<br />
mixture models and radial basis function networks. These models are interesting since they can be more easily<br />
trained on large amounts of data than neural networks, and the limitation of a short-list at the output may not<br />
be necessary. The use of Gaussians makes it also possible to structure the model by sharing some Gaussians<br />
using statistical criteria or high-level knowledge. On the other hand, Gaussian mixture models are a non-discriminative<br />
approach. Comparing them with neural networks could provide additional insight into the success<br />
of the neural network language model.<br />
<br />
= Source =<br />
Schwenk, H. Continuous space language models. Computer Speech<br />
Lang. 21, 492–518 (2007). ISIArticle</div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=continuous_space_language_models&diff=27320continuous space language models2015-12-15T15:24:10Z<p>Mgohari2: </p>
<hr />
<div>= Introduction =<br />
This paper describes the use of a neural network language model for large vocabulary continuous speech recognition.<br />
The underlying idea of this approach is to attack the data sparseness problem by performing the language model probability<br />
estimation in a continuous space. Highly efficient learning algorithms are described that enable the use of training<br />
corpora of several hundred million words. It is also shown that this approach can be incorporated into a large vocabulary<br />
continuous speech recognizer using a lattice re scoring framework at a very low additional processing time<br />
<br />
<br />
In certain fields of study such as speech recognition or machine translation, for some acoustic signal <math>\,x</math> or the source sentence to be translated <math>\,e</math>, it is common to model these problems as finding the sequence of words <math>\,w^*</math> that has the highest probability of occurring given <math>\,x</math> or <math>\,e</math>. This can be written as:<br />
<br />
<math>w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w)</math><br />
<br />
An acoustic or translation model can then be used for <math>\,P(x|w)</math>, similar to the idea behind LDA and QDA, and it remains to create a language model <math>\,P(w)</math> to estimate the probability of any sequence of words <math>\,w</math>.<br />
<br />
This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate <math>\,P(w)</math>.<br />
<br />
= Back-off n-grams Model =<br />
<br />
A sequence of words will be defined as <math>\,w^i_1=(w_1,w_2,\dots,w_i)</math> and the formula for the probability <math>\,P(w)</math> can be rewritten as:<br />
<br />
<math>P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1)</math><br />
<br />
It is common to estimate <math>\,P(w_i|w^{i-1}_1)</math> through:<br />
<br />
<math>\,P(w_i|w^{i-1}_1)\approx\frac{\mbox{number of occurrence of the sequence} (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence} (w_1,\dots,w_{i-1})}</math><br />
<br />
However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.<br />
<br />
To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:<br />
<br />
<math>\,P(w_i|w^{i-1}_1) = \begin{cases} <br />
\frac{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_{i-1})}, & \mbox{if number of occurrence of}\ (w_1,\dots,w_i)\ \mbox{is greater than some constant K} \\<br />
\alpha P(w_i|w^{i-1}_2), & \mbox{otherwise} <br />
\end{cases}</math><br />
<br />
<math>\,\alpha</math> is typically a discounting factor that is less than 1 to account for the lack of direct data. It usually depends on the word sequence.<br />
<br />
The general algorithm is then, if the data set does contain the sequence then calculate probability directly. Otherwise, apply a discounting factor and calculate the conditional probability with the first word in the sequence removed. For example, if the word sequence was "The dog barked" and it did not exist in the training set then the formula would be written as:<br />
<br />
<math>\,P(\mbox{barked}|\mbox{the,dog}) \approx \alpha P(\mbox{barked}|\mbox{dog})</math><br />
<br />
= Model =<br />
The neural network language model has to perform two tasks: first, project all words of the context<br />
<math>\,h_j</math> = <math>\,w_{j-n+1}^{j-1}</math> onto a continuous space, and second, calculate the language model probability <math>P(w_{j}=i|h_{j})</math>. <br />
The researchers for this paper sought to find a better model for this probability than the back-off n-grams model. Their approach was to map the n-1 words sequence onto a multi-dimension continuous space using a layer of neural network followed by another layer to estimate the probabilities of all possible next words. The formulas and model goes as follows:<br />
<br />
For some sequence of n-1 words, encode each word using 1 of K encoding, i.e. 1 where the word is indexed and zero everywhere else. Label each 1 of K encoding by <math>(w_{j-n+1},\dots,w_j)</math> for some n-1 word sequence at the j'th word in some larger context.<br />
<br />
Let P be a projection matrix common to all n-1 words and let<br />
<br />
<math>\,a_i=Pw_{j-n+i},i=1,\dots,n-1</math><br />
<br />
Let H be the weight matrix from the projection layer to the hidden layer and the state of H would be:<br />
<math>\,h=tanh(Ha + b)</math> where A is the concatenation of all <math>\,a_i</math> and <math>\,b</math> is some bias vector<br />
<br />
Finally, the output vector would be:<br />
<br />
<math>\,o=Vh+k</math> where V is the weight matrix from hidden to output and k is another bias vector. <math>\,o</math> would be a vector with same dimensions as the total vocabulary size and the probabilities can be calculated from <math>\,o</math> by applying the softmax function.<br />
<br />
The following figure shows the Architecture of the neural network language model. <math>\,h_j</math> denotes the context <math>\,w_{j-n+1}^{j-1}</math>. P is the size of one projection and H and N is the<br />
size of the second hidden and output layer, respectively. When short-lists are used the size of the output layer is much smaller than the size<br />
of the vocabulary.<br />
<br />
[[File:Q3.png]]<br />
<br />
In contrast to standard langua[[File:Qq.png]]ge modeling where we want to know the probability of a word i given its<br />
context, <math>P(w_{j} = i|h_{j}) </math>, the neural network simultaneously predicts the language model probability of all words<br />
in the word list:<br />
<br />
[[File:Q4.png]]<br />
<br />
= Optimization and Training =<br />
The training was done with standard back-propagation on minimizing the error function:<br />
<br />
<math>\,E=\sum_{i=1}^N t_i\ log p_i + \epsilon(\sum_{i,j}h^2_{ij}+\sum_{i,j}v^2_{ij})</math><br />
<br />
<math>\,t_i</math> is the desired output vector and the summations inside the epsilon bracket are regularization terms to prevent overfitting of <math>\,H</math> and <math>\,V</math>.<br />
<br />
The researchers used stochastic gradient descent to prevent having to sum over millions of examples worth of error and this sped up training time.<br />
<br />
An issue the researchers ran into using this model was that it took a long time to calculate language model probabilities compared to traditional back-off n-grams model and reduced its suitability for real time predictions. To solve this issue, several optimization techniques were used.<br />
<br />
===Lattice rescoring===<br />
<br />
It is common to keep track of additional possible solutions instead of just the most obviously likely solution in a lattice structure, i.e. a tree like structure where branches can merge and each branch represents a possible solution. For example from the paper using a tri-gram model, i.e. predict third word from first two words, the following lattice structure was formed:<br />
<br />
[[File:Lattice.PNG]]<br />
<br />
Any particular branch where two nodes have the same words can be merged. For example, "a,problem" was merged in the middle of the lattice because the tri-gram model would estimate the same probability at the point for both branch. Similary, "that_is,not" and "there_is,not" cannot be merged before the preceding two words to predict with are different.<br />
<br />
After this structure is created with a traditional back-off n-grams model, the neural network is then used to re-score the lattice and the re-scored lattice is used to make predictions.<br />
<br />
===Short List===<br />
<br />
In any language, there is usually a small set of commonly used words that form almost all of written or spoken thought. The short-list idea is that rather than calculating every single probability for even the rarest words, the neural network only calculates a small subset of the most common words. This way, the output vector can be significantly shrunk from <math>\,\mbox{N}</math> to some much smaller number <math>\,\mbox{S}</math>.<br />
<br />
If any rare words do occur, their probabilities are calculated using the traditional back-off n-grams model. The formula then goes as follows from the paper:<br />
<br />
[[File:shortlist.PNG]]<br />
<br />
Where L is the event that <math>\,w_t</math> is in the short-list.<br />
<br />
===Sorting and Bunch===<br />
<br />
The neural network predicts all the probabilities based on some sequence of words. If the probability of two different sequences of words are required but their relationship is such that for sequence 1, <math>\,w=(w_1,\dots,w_{i-1},w_i)</math> and sequence 2, <math>\,w^'=(w_1,\dots,w_{i-1},w^'_i)</math>, they differ only in the last word. Then only a single feed through the neural network is required. This is because the output vector using the context <math>\,(w_1,\dots,w_{i-1})</math> would predict the probabilities for both <math>\,w_i</math> and <math>\,w^'_i</math> being next. Therefore it is efficient to merge any sequence who have the same context.<br />
<br />
Modern day computers are also very optimized for linear algebra and it is more efficient to run multiple examples at the same time through the matrix equations. The researchers called this bunching and simple testing showed that this decreased processing time by a factor of 10 when using 128 examples at once compared to 1.<br />
<br />
= Training and Usage =<br />
<br />
The researchers used numerous optimization techniques during training and their results were summarized in the paper as follows:<br />
<br />
[[File:fast_training.PNG]]<br />
<br />
Since the model only trains to predict based on the last n-1 words, at certain points there will be less than n-1 words and adjustments must be made. The researchers considered two possibilities, using traditional models for these n-grams or filling up the n-k words with some filler word up to n-1. After some testing, they found that requests for small n-gram probabilities were pretty low and they decided to use traditional back-off n-gram model for these cases.<br />
<br />
= Results =<br />
<br />
In general the results were quite good. When this neural network + back-off n-grams hybrid was used in combination with a number of acoustic speech recognition models, they found that perplexity, lower the better, decreased by about 10% in a number of cases compared with traditional back-off n-grams only model. Some of their results are summarized as follows:<br />
<br />
[[File:results1.PNG]]<br />
<br />
[[File:results2.PNG]]<br />
<br />
The following figure shows the word error rates on the 2003 evaluation test set for the back-off LM and the hybrid LM, trained only on CTS data (left bars for<br />
each system) and interpolated with the broadcast news LM (right bars for each system).<br />
<br />
[[File:Q6.png]]<br />
<br />
A perplexity reduction of about 9% relative is obtained independently of the size of the language model<br />
training data. This gain decreases to approximately 6% after interpolation with the back-off language model<br />
trained on the additional BN corpus of out-of domain data. It can be seen that the perplexity of the hybrid<br />
language model trained only on the CTS data is better than that of the back-off reference language model<br />
trained on all of the data (45.5 with respect to 47.5). Despite these rather small gains in perplexity, consistent<br />
word error reductions were observed.<br />
<br />
= Conclusion =<br />
<br />
This paper described the theory and an experimental evaluation of a new approach to language modeling for large vocabulary continuous speech recognition based on the idea to project the words onto a continuous space and to perform the probability estimation in this space. This method is fast to the level that the neural network language model can be used in a real-time speech recognizer. The necessary capacity of the neural network is an important issue. Three possibilities were explored: increasing the size of the hidden layer, training several networks and interpolating them together, and using large projection layers. Increasing the size of the hidden layer gave only modest improvements in word error,<br />
at the price of very long training times. In this respect, the second solution is more interesting as the networks<br />
can be trained in parallel. Large projection layers appear to be the best choice as this has little impact on the<br />
complexity during training or recognition.The neural network language model is able to cover different speaking styles, ranging from rather well formed speech with few errors (broadcast news) to very relaxed speaking with many errors in syntax and semantics (meetings and conversations). It is claimed that the combination of the developed neural network and a back-off language model can be considered as a serious alternative to the commonly used back-off language models alone.<br />
<br />
This paper also proposes to investigate new training criteria for the neural network language model. Language<br />
models are almost exclusively trained independently from the acoustic model by minimizing the perplexity<br />
on some development data, and it is well known that improvements in perplexity do not necessarily<br />
lead to reductions in the word error rate.<br />
<br />
The continuous representation of the words in the neural network language model offers new ways to perform<br />
constrained language model adaptation. For example, the continuous representation of the words can be<br />
changed so that the language model predictions are improved on some adaptation data, e.g., by moving some<br />
words closer together which appear often in similar contexts. The idea is to apply a transformation on the<br />
continuous representation of the words by adding an adaptation layer between the projection layer and the<br />
hidden layer. This layer is initialized with the identity transformation and then learned by training the neural<br />
network on the adaptation data. Several variants of this basic idea are possible, for example using shared<br />
block-wise transformations in order to reduce the number of free parameters. In comparison with back-off language models whose complexity increases exponentially with the length of context, complexity of neural network language model increases<br />
linearly with the order of the n-gram and with the size of the vocabulary. This linearly increase in parameters is an important practical advantage that<br />
enables us to consider longer span language models with a negligible increase of the memory and time complexity. <br />
<br />
<br />
The underlying idea of the continuous space language model described here is to perform the probability<br />
estimation in a continuous space. Although only neural networks were investigated in this work, the approach<br />
is not inherently limited to this type of probability estimator. Other promising candidates include Gaussian<br />
mixture models and radial basis function networks. These models are interesting since they can be more easily<br />
trained on large amounts of data than neural networks, and the limitation of a short-list at the output may not<br />
be necessary. The use of Gaussians makes it also possible to structure the model by sharing some Gaussians<br />
using statistical criteria or high-level knowledge. On the other hand, Gaussian mixture models are a non-discriminative<br />
approach. Comparing them with neural networks could provide additional insight into the success<br />
of the neural network language model.<br />
<br />
= Source =<br />
Schwenk, H. Continuous space language models. Computer Speech<br />
Lang. 21, 492–518 (2007). ISIArticle</div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=generating_text_with_recurrent_neural_networks&diff=27297generating text with recurrent neural networks2015-12-14T19:04:58Z<p>Mgohari2: </p>
<hr />
<div>= Introduction =<br />
<br />
The goal of this paper is to introduce a new type of recurrent neural network for character-level language modelling that allows the input character at a given timestep to multiplicatively gate the connections that make up the hidden-to-hidden layer weight matrix. The paper also introduces a solution to the problem of vanishing and exploding gradients by applying a technique called Hessian-Free optimization to effectively train a recurrent network that, when unrolled in time, has approximately 500 layers. At the date of publication, this network was arguably the deepest neural network ever trained successfully. <br />
<br />
Strictly speaking, a language model is a probability distribution over sequences of words or characters, and such models are typically used to predict the next character or word in a sequence given some number of preceding characters or words. Recurrent neural networks are naturally applicable to this task, since they make predictions based on a current input and a hidden state whose value is determined by some number of previous inputs. Alternative methods that the authors compare their results to include a hierarchical Bayesian model called a 'sequence memoizer' <ref> Wood, F., C. Archambeau, J. Gasthaus, L. James, and Y.W. The. [http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/airg/readings/2012_02_28_a_stochastic_memoizer.pdf "A Stochastic Memoizer for Sequence Data"] ICML, (2009) </ref> and a mixture of context models referred to as PAQ <ref> Mahoney, M. [https://repository.lib.fit.edu/bitstream/handle/11141/154/cs-2005-16.pdf?sequence=1&isAllowed=y "Adaptive Weighing of Context Models for Lossless Data Compression"], Florida Institute of Technology Technical Report, (2005) </ref>, which actually includes word-level information (rather strictly character-level information). The multiplicative RNN introduced in this paper improves on the state-of-the-art for solely character-level language modelling, but is somewhat worse than the state-of-the-art for text compression. <br />
<br />
To give a brief review, an ordinary recurrent neural network is parameterized by three weight matrices, <math>\ W_{hi} </math>, <math>\ W_{hh} </math>, and <math>\ W_{oh} </math>, and functions to map a sequence of <math> N </math> input states <math>\ [i_1, ... , i_N] </math> to a sequence of hidden states <math>\ [h_1, ... , h_N] </math> and a sequence of output states <math>\ [o_1, ... , o_N] </math>. The matrix <math>\ W_{hi} </math> parameterizes the mapping from the current input state to the current hidden state, while the matrix <math>\ W_{hh} </math> parameterizes the mapping from the previous hidden state to current hidden state, such that the current hidden state is function of the previous hidden state and the current input state. Finally, the matrix <math>\ W_{oh} </math> parameterizes the mapping from the current hidden state to the current output state. So, at a given timestep <math>\ t </math>, the values of the hidden state and output state are as follows:<br />
<br />
<br />
:<math>\ h_t = tanh(W_{hi}i_t + W_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
:<math>\ o_t = W_{oh}h_t + b_o </math> <br />
<br />
<br />
where <math>\ b_o</math> and <math>\ b_h</math> are bias vectors. Typically, the output state is converted into a probability distribution over characters or words using the softmax function. The network can then be treated as a generative model of text by sampling from this distribution and providing the sampled output as the input to the network at the next timestep.<br />
<br />
Recurrent networks are known to be very difficult to train due to the existence a highly unstable relationship between a network's parameters and the gradient of its cost function. Intuitively, the surface of the cost function is intermittently punctuated by abrupt changes (giving rise to exploding gradients) and nearly flat plateaus (giving rise to vanishing gradients) that can effectively become poor local minima when a network is trained through gradient descent. Techniques for improving training include the use of Long Short-Term Memory networks <ref> Hochreiter, Sepp, and Jürgen Schmidhuber. [http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf "Long short-term memory."] Neural computation 9.8 (1997): 1735-1780. </ref>, in which memory units are used to selectively preserve information from previous states, and the use of Echo State networks, <ref> Jaeger, H. and H. Haas. [http://www.sciencemag.org/content/304/5667/78.short "Harnassing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication."] Science, 204.5667 (2004): 78-80. </ref> which learn only the output weights on a network with recurrent connections that implement a wide range of time-varying patterns. In this paper, the method of Hessian free optimization is used instead of these alternatives. <br />
<br />
[[File:RNN.png | frame | centre | A depiction of a recurrent neural network unrolled through three time steps.]]<br />
<br />
= Hessian-Free Optimization = <br />
<br />
While this optimization technique is described elsewhere in Martens (2010) <ref> Martens, J. [http://icml2010.haifa.il.ibm.com/papers/458.pdf "Deep learning via Hessian-free optimization."] ICML, (2010) </ref><br />
, its use is essential to obtaining the successful results reported in this paper. In brief, the technique involves uses information about the 2nd derivatives of the cost function to perform more intelligent parameter updates. This information is helpful because in cases where the gradient is changing very slowly on a particular dimension, it is more efficient to take larger steps in the direction of descent along this dimension. Alternatively, if the the gradient is changing very rapidly on a particular dimension, then it makes sense to take smaller steps to avoid 'bouncing' off of a step incline in the cost function and moving to a less desirable location in parameter space. The relevant 2nd order information is computed using the method of finite differences to avoid computing the Hessian of the cost function.In fact instead of computing and inverting the H matrix when updating equations, the Gauss-Newton approximation is used for the Hessian matrix which is quite good approximation to the Hessian and practically cheaper to compute. <br />
<br />
What is important about this technique is that it provides a solution to problem of vanishing and exploding gradients during the training of recurrent neural networks. Vanishing gradients are accommodated by descending much more rapidly along the cost function in areas where it has relatively low curvature (e.g., when the cost function is nearly flat), while exploding gradients are accommodated by descending much more slowly along the cost function in areas where it has relatively high curvature (e.g., when there is a steep cliff). The figure below illustrates how hessian free optimization improves the training of neural networks in general. <br />
<br />
[[File:HFF.png | frame | centre | On the left is training with naive gradient descent, and on the right is training via the use of 2nd order information about the cost function.]]<br />
<br />
= Multiplicative Recurrent Neural Networks = <br />
<br />
The authors report that using a standard neural network trained via Hessian-free optimization produces only mediocre results. As such, they introduce a new architecture called a multiplicative recurrent neural network (MRNN). The motivating intuition behind this architecture is that the input at a given time step should both additively contribute to the hidden state (though the mapping performed by the input-to-hidden weights) and additionally determine the weights on the recurrent connections to the hidden state. This approach came from viewing an RNN as a model of an tree in which each node is a hidden state vector and each edge is labelled by a character that determines how the parent node gives rise to the child node. In other words, the idea is to define a unique weight matrix <math>\ W_{hh} </math> for each possible input. The reason this design is hypothesized to the improve the predictive adequacy of the model is due to the idea that the ''conjunction'' of the input at one time step and the hidden state at the previous time step is important. Capturing this conjunction requires the input to influence the contribution of the previous hidden state to the current hidden state. Otherwise, the previous hidden state and the current input will make entirely independent contributions to the calculation of the current hidden state. Formally, this changes the calculation of the hidden state at a given time step as follows:<br />
<br />
<br />
:<math>\ h_t = tanh(W_{hi}i_t + W^{i_t}_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
where <math>\ W^{i_t}_{hh} </math> is an input-specific hidden-to-hidden weight matrix. As a first approach to implementing this MRNN, the authors suggest using a tensor of rank 3 to store the hidden-to-hidden weights. The idea is that the tensor stores one weight matrix per possible input; when the input is provided as a one-hot vector, tensor contraction (i.e. a generalization of matrix multiplication) can be used to extract the 'slice' of the tensor that contains the appropriate set of weights. One problem with this approach is that it quickly becomes impractical to store the hidden-to-hidden weights as a tensor if the dimensionality of the hidden state has a large number of dimensions. For instance, if a network's hidden layer encodes a vector with 1000 dimensions, then the number of parameters in the tensor that need to be learned will be equal to <math>\ 1000^2 * N </math>, where <math>\ N </math> is the vocabulary size. In short, this method will add many millions of parameters to a model for a non-trivially sized vocabulary. <br />
<br />
To fix this problem, the tensor is factored using a technique described in Taylor & Hinton (2009) <ref>Taylor, G. and G. Hinton. [http://www.cs.toronto.edu/~fritz/absps/fcrbm_icml.pdf "Factored Conditional Restricted Boltzmann Machines for Modeling Motion Style"] ICML (2009) </ref>. The idea is to define three matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math> that approximate the use of a tensor in determining the value of <math>\ W^{i_t}_{hh} </math> as follows:<br />
<br />
<br />
:<math>\ W^{i_t}_{hh} = W_{hf} \cdot diag(W_{fi}i_t) \cdot W_{fh} </math><br />
<br />
<br />
Intuitively, this factorization produces two vectors from the current input state and the previous hidden state, takes their element-wise product, and applies a linear transformation to produce the input to the hidden layer at the current timestep. The triangle units in the figure below indicate where the element-wise product occurs, and the connections into and out of these units are parameterized by the matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math>. The element-wise multiplication is implemented by diagonalizing the matrix-vector product <math>\ W_{fi}i_t </math>, and if the dimensionality of this matrix-vector product (i.e. the dimensionality of the layer of multiplicative units) is allowed to be arbitrarily large, then this factorization is just as expressive as using a tensor to store the hidden-to-hidden weights. <br />
<br />
[[File:MRNN.png | frame | centre | A depiction of a multiplicative recurrent neural network unrolled through three time steps.]]<br />
<br />
In the experiments described below, an MRNN is trained via Hessian Free optimization on sequences of 250 characters. The first 50 characters used to condition the hidden state, so only 200 predictions are generated per sequence. 1500 hidden units were used, along with 1500 factors (i.e. multiplicative gates, or the triangles in the figure above), yielding an unrolled network of 500 layers if the multiplicative units are treated as forming a layer. Training was performed with a parallelized system consisting of 8 GPUs. A vocabulary of 86 characters was used in all cases.<br />
<br />
= Quantitative Experiments =<br />
<br />
To compare the performance of the MRNN to that of the sequence memorizer and PAQ, three 100mb datasets were used: a selection of wikipedia articles, a selection of New York Times articles, and a corpus of all available articles published in NIPS and JMLR. The last 10 million characters in each dataset were held out for testing. Additionally, the MRNN was trained on the larger corpora from which the wikipedia text and NYT articles were drawn (i.e. all of wikipedia, and the entire set of NYT articles). <br />
<br />
The models were evaluated by calculating the number of bits per character achieved by each model on the 3 test sets. This metric is essentially a measure of model perplexity, which defines how well a given model predicts the data it is being tested on. If the number of bits per character is high, this means that the model is, on average, highly uncertain about the value of each character in the test set. If the number of bits per character is low, then the model is less uncertain about the value of each character in the test set. One way to think about this quantity is as the average amount of additional information (in bits) needed by the model to exactly identify the value of each character in the test set. So, a lower measure is better, indicating that the model achieves a good representation of the underlying data. (it is sometimes helpful to think of a language model as a compressed representation of a text corpus). <br />
<br />
As illustrated in the table below, the MRNN achieves a lower number of bits per character than the hierarchical bayesian model, but a higher number of bits per character than the PAQ model (which recall, is not a strictly character level model). The numbers in brackets indicate the bits per character achieved on the training data, and the column labelled 'Full Set' reports the results of training the MRNN on the full wikipedia and NYT corpora. <br />
<br />
[[File:bits.png | frame | centre | Bits per character achieved by each model on each dataset.]]<br />
<br />
These results indicate that the MRNN beat the existing state-of-the-art for pure character-level language modelling at the time of publication. <br />
<br />
= Qualitative Experiments =<br />
<br />
By examining the output of the MRNN, it is possible to see what kinds of linguistic patterns it is able to learn. Most striking is the fact that the model consistently produces correct words from a fairly sophisticated vocabulary. The model is also able to balance parentheses and quotation marks over many time steps, and it occasionally produces plausible non-words such as 'cryptoliation' and 'homosomalist'. The text in the figure below was produced by running the model in generative mode less than 10 times using the phrase 'The meaning of life is' as an initial input, and then selecting the most interesting output sequence. The model was trained on wikipedia to produce the results in the figure below. The character '?' indicates an unknown item, and some of the spacing and punctuation oddities are due to preprocessing and are apparently common in the dataset. <br />
<br />
[[File:text.png | frame | centre | A selection of text generated by an MRNN initialized with the sequence "The meaning of life is...".]]<br />
<br />
Another interesting qualitative demonstration of the model's abilities involves initializing the model with a more complicated sequence and seeing what sort of continuations it produces. In the figure below, a number of sampled continuations of the phrase 'England, Spain, France, Germany' are shown. Generally, the model is able to provide continuations that preserve the list-like structure of the phrase. Moreover, the model is also able to recognize that the list is a list of locations, and typically offers additional locations as its predicted continuation of the sequence. <br />
<br />
[[File:locations.png | frame | centre | Selections of text generated by an MRNN initialized with the sequence "England, Spain, France, Germany".]]<br />
<br />
What is particularly impressive about these results is the fact that the model is learning a distribution over sequences of characters only. From this distribution, a broad range of syntactic and lexical knowledge emerges. It is also worth noting that it is much more efficient to train a model with a small character-level vocabulary than it is to train a model with a word-level vocabulary (which can have tens of thousands of items). As such, the character-level MRNN is able to scale to large datasets quite well.<br />
<br />
Moreover, they find that the MRNN is sensitive to some notations like the initial bracket if such string doesn't occur in the training set. They claim that any method which is based on precise context matches is fundamentally incapable of utilizing long contexts, because the probability that a long context occurs more than once is very small.<br />
<br />
= Discussion =<br />
<br />
One aspect of this work that is worth considering concerns the degree to which the use of input-dependent gating of the information being passed from hidden state to hidden state actually improves the results over and above the use of a standard recurrent neural network. Presumably, the use of hessian free optimization allows one to successfully train such a network, so it would be helpful to see a comparison to the results obtained using an MRNN.MRNNs already learn surprisingly good language models<br />
using only 1500 hidden units, and unlike other approaches such as the sequence memoizer and PAQ, they are easy to extend along various dimensions. Otherwise, it is hard to discern the relative importance of the optimization technique and the network architecture in achieving the good language modelling results reported in this paper.<br />
The MRNN assigns probability to plausible words that do not exist in the training set. This is a good property, that enabled the MRNN to deal with real words that it did not see in the training set. one advantage of this model is that, this model avoids using a huge softmax over all known words by predicting the next word based on a sequence of character predictions, while some word-level language models actually make up binary spellings of words in a way that they can predict them one bit at each time.<br />
<br />
= Bibliography = <br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=mULTIPLE_OBJECT_RECOGNITION_WITH_VISUAL_ATTENTION&diff=27078mULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION2015-12-04T17:33:10Z<p>Mgohari2: </p>
<hr />
<div>= Introduction =<br />
<br />
Recognizing multiple objects in images has been one of the most important goals of computer vision. Previous work in this classification of sequences of characters often employed a sliding window detector with an individual character-classifier. However, these systems can involve setting components in a case-specific manner for determining possible object locations. In this paper an attention-based model for recognizing multiple objects in images is presented. The proposed model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image. It has been shown that the proposed method is more accurate than the state-of-the-art convolutional networks and uses fewer parameters and less computation.<br />
One of the main drawbacks of convolutional networks (ConvNets) is their poor scalability with increasing input image size so efficient implementations of these models have become necessary. In this work, the authors take inspiration from the way humans perform visual sequence recognition tasks such as reading by continually moving the fovea to the next relevant object or character, recognizing the individual object, and adding the recognized object to our internal representation of the sequence. The proposed system is a deep recurrent neural network that at each step processes a multi-resolution crop of the input image, called a “glimpse”. The network uses information from the glimpse to update its internal representation of the input, and outputs the next glimpse location and possibly the next object in the sequence. The process continues until the model decides that there are no more objects to process.<br />
<br />
= Deep Recurrent Visual Attention Model:=<br />
<br />
For simplicity, they first describe how our model can be applied to classifying a single object and later show how it can be extended to multiple objects. Processing an image x with an attention based model is a sequential process with N steps, where each step consists of a glimpse. At each step n, the model receives a location ln along with a glimpse observation xn taken at location ln. The model uses the observation to update its internal state and outputs the location ln+1 to process at the next time-step. A graphical representation of the proposed model is shown in Figure 1.<br />
<br />
[[File:0.PNG | center]]<br />
<br />
The above model can be broken down into a number of sub-components, each mapping some input into a vector output. In this paper the term “network” is used to describe these sub-components.<br />
<br />
Glimpse Network:<br />
<br />
The job of the glimpse network is to extract a set of useful features from location of a glimpse of the raw visual input. The glimpse network is a non-linear function that receives the current input image patch, or glimpse (<math>x_n</math>), and its location tuple (<math>l_n</math>) as input and outputs a vector showing that what location has what features. <br />
There are two separate networks in the structure of glimpse network, each of which has its own input. The first one which extracts features of the image patch takes an image patch as input and consists of three convolutional hidden layers without any pooling layers followed by a fully connected layer. Separately, the location tuple is mapped using a fully connected hidden layer. Then element-wise multiplication of two output vectors produces the final glimpse feature vector <math>g_n</math>.<br />
<br />
Recurrent Network:<br />
<br />
The recurrent network aggregates information extracted from the individual glimpses and combines the information in a coherent manner that preserves spatial information. The glimpse feature vector gn from the glimpse network is supplied as input to the recurrent network at each time step.<br />
The recurrent network consists of two recurrent layers. Two outputs of the recurrent layers are defined as <math>r_n^{(1)}</math> and <math>r_n^{(2)}</math>.<br />
<br />
Emission Network:<br />
<br />
The emission network takes the current state of recurrent network as input and makes a prediction on where to extract the next image patch for the glimpse network. It acts as a controller that directs attention based on the current internal states from the recurrent network. It consists of a fully connected hidden layer that maps the feature vector <math>r_n^{(2)}</math> from the top recurrent layer to a coordinate tuple <math>l_{n+1}</math>.<br />
<br />
Context Network:<br />
<br />
The context network provides the initial state for the recurrent network and its output is used by the emission network to predict the location of the first glimpse. The context network C(.) takes a down-sampled low-resolution version of the whole input image <math>I_coarse</math> and outputs a fixed length vector <math>c_I</math> . The contextual information provides sensible hints on where the potentially interesting regions are in a given image. The context network employs three convolutional layers that map a coarse image <math>I_coarse</math> to a feature vector.<br />
<br />
Classification Network:<br />
<br />
The classification network outputs a prediction for the class label y based on the final feature vector <math>r_N^{(1)}</math> of the lower recurrent layer. The classification network has one fully connected hidden layer and a softmax output layer for the class y.<br />
<br />
In order to prevent the model to learn from contextual information than by combining information from different glimpses, the context network and classification network are connected to different recurrent layers in the deep model. This will help the deep recurrent attention model learn to look at locations that are relevant for classifying objects of interest.<br />
<br />
= Learning Where and What=<br />
<br />
Given the class labels y of image “I”, learning can be formulated as a supervised classification problem with the cross entropy objective function. The attention model predicts the class label conditioned on intermediate latent location variables l from each glimpse and extracts the corresponding patches. We can thus maximize likelihood of the class label by marginalizing over the glimpse locations.<br />
<br />
[[File:2eq.PNG | center]]<br />
<br />
Using some simplifications, the practical algorithm to train the deep attention model can be expressed as:<br />
<br />
[[File:3.PNG | center]]<br />
<br />
Where <math>\tilde{l^m}</math> is an approximation of location of glimpse “m”.This means that we can sample he glimpse location prediction from the model after each glimpse. In the above equation, log likelihood (in the second term) has an unbounded range that can introduce substantial high variance in the gradient estimator and sometimes induce an undesired large gradient update that is backpropagated through the rest of the model. So in this paper this term is replaced with a 0/1 discrete indicator function (R) and a baseline technique(b) is used to reduce variance in the estimator. <br />
<br />
[[File:4eq.PNG | center]]<br />
<br />
So the gradient update can be expressed as following:<br />
<br />
[[File:5.PNG | center]]<br />
<br />
In fact, by using the 0/1 indicator function, the learning rule from the above equation is equivalent to the REINFORCE learning model where R is the expected reward.<br />
During inference, the feedforward location prediction can be used as a deterministic prediction on<br />
the location coordinates to extract the next input image patch for the model. Alternatively, our marginalized objective function suggests a procedure to estimate the expected class prediction by using samples of location sequences <math>\{\tilde{l_1^m},\dots,\tilde{l_N^m}\}</math> and averaging their predictions.<br />
<br />
[[File:6.PNG | center]]<br />
<br />
= Multi Object/Sequence Classification as a Visual Attention Task=<br />
<br />
Our proposed attention model can be easily extended to solve classification tasks involving multiple objects. To train the recurrent network, in this case, the multiple object labels for a given image need to be cast into an ordered sequence {y1,...,ys}. Assuming there are S targets in an image, the objective function for the sequential prediction is:<br />
<br />
[[File:7.PNG | center]]<br />
<br />
= Experiments:=<br />
<br />
To show the effectiveness of the deep recurrent attention model (DRAM), multi-object classification tasks are investigated on two different datasets: MNIST and multi-digit SVHN.<br />
<br />
MNIST Dataset Results:<br />
<br />
Two main evaluation of the method is done using MNIST dataset:<br />
<br />
1)Learning to find digits<br />
<br />
2)Learning to do addition (The model has to find where each digit is and add them up. The task is to predict the sum of the two digits in the image as a classification problem)<br />
<br />
The results for both experiments are shown in table 1 and table 2. As stated in the tables, the DRAM model with a context network significantly outperforms the other models.<br />
<br />
[[File:8.PNG | center]]<br />
<br />
SVHN Dataset Results:<br />
<br />
The publicly available multi-digit street view house number (SVHN) dataset consists of images of digits taken from pictures of house fronts. This experiment is more challenging and We trained a model to classify all the digits in an image sequentially. Two different model are implemented in this experiment:<br />
First, the label sequence ordering is chosen to go from left to right as the natural ordering of the house number.in this case, there is a performance gap between the state-of-the-art deep ConvNet and a single DRAM that “reads” from left to right. Therefore, a second recurrent attention model to “read” the house numbers from right to left as a backward model is trained. The forward and backward model can share the same weights for their glimpse networks but they have different weights for their recurrent and their emission networks. The model performance is shown in table 3:<br />
<br />
[[File:9.PNG | center]]<br />
<br />
As shown in the table, the proposed deep recurrent attention model (DRAM) outperforms the state-ofthe-<br />
art deep ConvNets on the standard SVHN sequence recognition task.<br />
<br />
= Discussion and Conclusion:=<br />
<br />
The recurrent attention models only process a selected subset of the input have less computational cost than a ConvNet that looks over an entire image. Also, they can naturally work on images of different size with the same computational cost independent of the input dimensionality. Moreover, the attention-based model is less prone to over-fitting than ConvNets, likely because of the stochasticity in the glimpse policy during training. Duvedi C et al. <ref><br />
Duvedi C and Shah P. [http://vision.stanford.edu/teaching/cs231n/reports/cduvedi_report.pdf Multi-Glance Attention Models For Image Classification ], <br />
</ref> developed a two glances approach that uses a combination of multiple Convolutional neural nets and recurrent neural nets. In this approach RNN’s generate a location for a glimpse within the image and a CNN extracts features from a glimpse of a fixed size at the selected location. In the next step the RNN’s generating the next glance for a glance. The process continues to generate the whole picture by combining features of all relevant patches together.<br />
<br />
<br />
= References=<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=mULTIPLE_OBJECT_RECOGNITION_WITH_VISUAL_ATTENTION&diff=27077mULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION2015-12-04T17:28:54Z<p>Mgohari2: </p>
<hr />
<div>= Introduction =<br />
<br />
Recognizing multiple objects in images has been one of the most important goals of computer vision. Previous work in this classification of sequences of characters often employed a sliding window detector with an individual character-classifier. However, these systems can involve setting components in a case-specific manner for determining possible object locations. In this paper an attention-based model for recognizing multiple objects in images is presented. The proposed model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image. It has been shown that the proposed method is more accurate than the state-of-the-art convolutional networks and uses fewer parameters and less computation.<br />
One of the main drawbacks of convolutional networks (ConvNets) is their poor scalability with increasing input image size so efficient implementations of these models have become necessary. In this work, the authors take inspiration from the way humans perform visual sequence recognition tasks such as reading by continually moving the fovea to the next relevant object or character, recognizing the individual object, and adding the recognized object to our internal representation of the sequence. The proposed system is a deep recurrent neural network that at each step processes a multi-resolution crop of the input image, called a “glimpse”. The network uses information from the glimpse to update its internal representation of the input, and outputs the next glimpse location and possibly the next object in the sequence. The process continues until the model decides that there are no more objects to process.<br />
<br />
= Deep Recurrent Visual Attention Model:=<br />
<br />
For simplicity, they first describe how our model can be applied to classifying a single object and later show how it can be extended to multiple objects. Processing an image x with an attention based model is a sequential process with N steps, where each step consists of a glimpse. At each step n, the model receives a location ln along with a glimpse observation xn taken at location ln. The model uses the observation to update its internal state and outputs the location ln+1 to process at the next time-step. A graphical representation of the proposed model is shown in Figure 1.<br />
<br />
[[File:0.PNG | center]]<br />
<br />
The above model can be broken down into a number of sub-components, each mapping some input into a vector output. In this paper the term “network” is used to describe these sub-components.<br />
<br />
Glimpse Network:<br />
<br />
The job of the glimpse network is to extract a set of useful features from location of a glimpse of the raw visual input. The glimpse network is a non-linear function that receives the current input image patch, or glimpse (<math>x_n</math>), and its location tuple (<math>l_n</math>) as input and outputs a vector showing that what location has what features. <br />
There are two separate networks in the structure of glimpse network, each of which has its own input. The first one which extracts features of the image patch takes an image patch as input and consists of three convolutional hidden layers without any pooling layers followed by a fully connected layer. Separately, the location tuple is mapped using a fully connected hidden layer. Then element-wise multiplication of two output vectors produces the final glimpse feature vector <math>g_n</math>.<br />
<br />
Recurrent Network:<br />
<br />
The recurrent network aggregates information extracted from the individual glimpses and combines the information in a coherent manner that preserves spatial information. The glimpse feature vector gn from the glimpse network is supplied as input to the recurrent network at each time step.<br />
The recurrent network consists of two recurrent layers. Two outputs of the recurrent layers are defined as <math>r_n^{(1)}</math> and <math>r_n^{(2)}</math>.<br />
<br />
Emission Network:<br />
<br />
The emission network takes the current state of recurrent network as input and makes a prediction on where to extract the next image patch for the glimpse network. It acts as a controller that directs attention based on the current internal states from the recurrent network. It consists of a fully connected hidden layer that maps the feature vector <math>r_n^{(2)}</math> from the top recurrent layer to a coordinate tuple <math>l_{n+1}</math>.<br />
<br />
Context Network:<br />
<br />
The context network provides the initial state for the recurrent network and its output is used by the emission network to predict the location of the first glimpse. The context network C(.) takes a down-sampled low-resolution version of the whole input image <math>I_coarse</math> and outputs a fixed length vector <math>c_I</math> . The contextual information provides sensible hints on where the potentially interesting regions are in a given image. The context network employs three convolutional layers that map a coarse image <math>I_coarse</math> to a feature vector.<br />
<br />
Classification Network:<br />
<br />
The classification network outputs a prediction for the class label y based on the final feature vector <math>r_N^{(1)}</math> of the lower recurrent layer. The classification network has one fully connected hidden layer and a softmax output layer for the class y.<br />
<br />
In order to prevent the model to learn from contextual information than by combining information from different glimpses, the context network and classification network are connected to different recurrent layers in the deep model. This will help the deep recurrent attention model learn to look at locations that are relevant for classifying objects of interest.<br />
<br />
= Learning Where and What=<br />
<br />
Given the class labels y of image “I”, learning can be formulated as a supervised classification problem with the cross entropy objective function. The attention model predicts the class label conditioned on intermediate latent location variables l from each glimpse and extracts the corresponding patches. We can thus maximize likelihood of the class label by marginalizing over the glimpse locations.<br />
<br />
[[File:2eq.PNG | center]]<br />
<br />
Using some simplifications, the practical algorithm to train the deep attention model can be expressed as:<br />
<br />
[[File:3.PNG | center]]<br />
<br />
Where <math>\tilde{l^m}</math> is an approximation of location of glimpse “m”.This means that we can sample he glimpse location prediction from the model after each glimpse. In the above equation, log likelihood (in the second term) has an unbounded range that can introduce substantial high variance in the gradient estimator and sometimes induce an undesired large gradient update that is backpropagated through the rest of the model. So in this paper this term is replaced with a 0/1 discrete indicator function (R) and a baseline technique(b) is used to reduce variance in the estimator. <br />
<br />
[[File:4eq.PNG | center]]<br />
<br />
So the gradient update can be expressed as following:<br />
<br />
[[File:5.PNG | center]]<br />
<br />
In fact, by using the 0/1 indicator function, the learning rule from the above equation is equivalent to the REINFORCE learning model where R is the expected reward.<br />
During inference, the feedforward location prediction can be used as a deterministic prediction on<br />
the location coordinates to extract the next input image patch for the model. Alternatively, our marginalized objective function suggests a procedure to estimate the expected class prediction by using samples of location sequences <math>\{\tilde{l_1^m},\dots,\tilde{l_N^m}\}</math> and averaging their predictions.<br />
<br />
[[File:6.PNG | center]]<br />
<br />
= Multi Object/Sequence Classification as a Visual Attention Task=<br />
<br />
Our proposed attention model can be easily extended to solve classification tasks involving multiple objects. To train the recurrent network, in this case, the multiple object labels for a given image need to be cast into an ordered sequence {y1,...,ys}. Assuming there are S targets in an image, the objective function for the sequential prediction is:<br />
<br />
[[File:7.PNG | center]]<br />
<br />
= Experiments:=<br />
<br />
To show the effectiveness of the deep recurrent attention model (DRAM), multi-object classification tasks are investigated on two different datasets: MNIST and multi-digit SVHN.<br />
<br />
MNIST Dataset Results:<br />
<br />
Two main evaluation of the method is done using MNIST dataset:<br />
<br />
1)Learning to find digits<br />
<br />
2)Learning to do addition (The model has to find where each digit is and add them up. The task is to predict the sum of the two digits in the image as a classification problem)<br />
<br />
The results for both experiments are shown in table 1 and table 2. As stated in the tables, the DRAM model with a context network significantly outperforms the other models.<br />
<br />
[[File:8.PNG | center]]<br />
<br />
SVHN Dataset Results:<br />
<br />
The publicly available multi-digit street view house number (SVHN) dataset consists of images of digits taken from pictures of house fronts. This experiment is more challenging and We trained a model to classify all the digits in an image sequentially. Two different model are implemented in this experiment:<br />
First, the label sequence ordering is chosen to go from left to right as the natural ordering of the house number.in this case, there is a performance gap between the state-of-the-art deep ConvNet and a single DRAM that “reads” from left to right. Therefore, a second recurrent attention model to “read” the house numbers from right to left as a backward model is trained. The forward and backward model can share the same weights for their glimpse networks but they have different weights for their recurrent and their emission networks. The model performance is shown in table 3:<br />
<br />
[[File:9.PNG | center]]<br />
<br />
As shown in the table, the proposed deep recurrent attention model (DRAM) outperforms the state-ofthe-<br />
art deep ConvNets on the standard SVHN sequence recognition task.<br />
<br />
= Discussion and Conclusion:=<br />
<br />
The recurrent attention models only process a selected subset of the input have less computational cost than a ConvNet that looks over an entire image. Also, they can naturally work on images of different size with the same computational cost independent of the input dimensionality. Moreover, the attention-based model is less prone to over-fitting than ConvNets, likely because of the stochasticity in the glimpse policy during training. Duvedi C et al. <ref><br />
Duvedi C and Shah P [http://vision.stanford.edu/teaching/cs231n/reports/cduvedi_report.pdf Multi-Glance Attention Models For Image Classification ]<br />
</ref> developed a two glances approach that uses a combination of multiple Convolutional neural nets and recurrent neural nets. In this approach RNN’s generate a location for a glimpse within the image and a CNN extracts features from a glimpse of a fixed size at the selected location. In the next step the RNN’s generating the next glance for a glance. The process continues to generate the whole picture by combining features of all relevant patches together.</div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=semi-supervised_Learning_with_Deep_Generative_Models&diff=27076semi-supervised Learning with Deep Generative Models2015-12-04T16:39:20Z<p>Mgohari2: </p>
<hr />
<div>= Introduction =<br />
<br />
Large labelled data sets have led to massive improvements in the performance of machine learning algorithms, especially supervised neural networks. However, the world in general is not labelled and there exists a far greater number of unlabelled data than labelled data. A common situation is to have a comparatively small quantity of labelled data paired with a larger amount of unlabelled data. This leads to the idea of a semi-supervised learning model where the unlabelled data is used to prime the model for relevant features and the labels are then learned for classification. A prominent example of this type of model is the restricted Boltzmann machine based Deep Belief Network (DBN). Where layers of RBM are trained to learn unsupervised features of the data and then a final classification layer is applied such that labels can be assigned. <br />
Unsupervised learning techniques sometimes create what is known as a generative model which creates a joint distribution <math>P(x, y)</math> (which can be sampled from). This is contrasted by the supervised discriminative model, which create conditional distributions <math>P(y | x)</math>. The paper combines these two methods to achieve high performance on benchmark tasks and uses deep neural networks in an innovative manner to create a layered semi-supervised classification/generation model.<br />
<br />
= Current Models and Limitations =<br />
<br />
The paper claims that existing unlabelled data models do not scale well for very large sets of unlabelled data. One example that they discuss is the Transductive SVM, which they claim does not scale well and that optimization for them is a problem. Graph based models suffer from sensitivity to their graphical structure which may make them rigid. Finally they briefly discuss other neural network based methods such as the Manifold Tangent Classifier that uses contrastive auto-encoders (CAEs) to deduce the manifold on which data lies. Based on the manifold hypothesis this means that similar data should not lie far from the manifold and they then can use something called TangentProp to train a classifier based on the manifold of the data. <br />
<br />
= Proposed Method =<br />
<br />
Rather than use the methods mentioned above the team suggests that using generative models based on neural networks would be beneficial. Current generative models lack string inference and scalability though. The paper proposes a method that uses variational inference for semi-supervised classification that will employ deep neural networks.<br />
<br />
== Latent Feature Discriminative Model (M1) ==<br />
<br />
The first sub-model that is described is used to model latent variables ('''z''') that embed features of the unlabelled data. Classification for this model is done separately based on the learned features from the unlabelled data. The key to this model is that the non-linear transform to capture features is a deep neural network. The generative model is based on the following equations: <br />
<br />
<div style="text-align: center;"><br />
<math>p(\mathbf{z}) = \mathcal{N}(\mathbf{z}|\mathbf{0,I})</math> <br />
<br />
<math>p(\mathbf{x|z}) = f(\mathbf{x};\mathbf{z,\theta})</math><br />
</div><br />
<br />
f is a likelihood function based on the parameters <math>\theta</math>. The parameters are tuned by a deep neural network. The posterior distribution <math>p(\mathbf{z}|\mathbf{x})</math> is sampled to train an arbitrary classifier for class labels <math> y </math>. Tnis approach offers substantial improvement in the performance of SVMs.<br />
<br />
== Generative Semi-Supervised Model (M2) ==<br />
<br />
The second model is based on a latent variable '''z''' but the class label <math>y</math> is also treated as a latent variable and used for training. If y is available then it is used as a latent variable, but if it is not, then '''z''' is also used. The following equations describe the generative processes where <math>Cat(y|\mathbf{\pi})</math> is some multinominal distribution. f is used similarly to in M1 but with an extra parameter. Classification is treated as inference by integrating over a class of an unlabelled data sample if y is not available which is done usually with the posterior <math>p_{\theta}(y|\mathbf{x})</math>.<br />
<br />
<br />
<div style="text-align: center;"><br />
<math>p(y) = Cat(y|\mathbf{\pi})</math><br />
<br />
<math>p(\mathbf{z}) = \mathcal{N}(\mathbf{z}|\mathbf{0,I})</math><br />
<br />
<math>p_{\theta}(\mathbf{x}|y, \mathbf{z}) = f(\mathbf{x};y,\mathbf{z,\theta})</math><br />
</div><br />
Another way to see this model is as a hybrid continuous-discrete mixture model, where the parameters are shared between the different components of the mixture.<br />
<br />
== Stacked Generative Semi-Supervised Model (M1+M2) == <br />
<br />
The two aforementioned models are concatenated to form the final model. The method in which this works is that M1 is learned while M2 uses the latent variables from model M1 ('''z_1''') as the data as opposed to raw values '''x'''. The following equations describe the entire model. The distributions of <math>p_{\theta}(\mathbf{z1}|y,\mathbf{z2})</math> and <math>p_{\theta}(\mathbf{x|z1})</math> are parametrized as deep neural networks. <br />
<br />
<div style="text-align: center;"><br />
<math>p_{\theta}(\mathbf{x}, y, \mathbf{z1, z2}) = p(y)p(\mathbf{z2})p_{\theta}(\mathbf{z1}|y, \mathbf{z2})p_{\theta}(\mathbf{x|z1})</math><br />
<br />
<br />
</div><br />
<br />
The problems of intractable posterior distributions is solved with the work of Kingma and Welling using variational inference. These inference networks are not described in detail in the paper. The following algorithms show the method in which the optimization for the methods is performed. <br />
<br />
<br />
[[File:Kingma_2014_1.png |centre|thumb|upright=3|]]<br />
<br />
The posterior distributions are, as usual, intractable, but this problem is resolved through the use of a fixed form distribution <math>q_{\phi}(\mathbf{x|z}</math>, with <math>\phi</math> as parameters that approximate <math>p(\mathbf{z|x})</math>. The equation <math>q_{\phi}</math> is created as an inference network, which allows for the computation of global parameters and does not require computation for each individual data point.<br />
<br />
[[File:kingma_2014_4.png |centre|]]<br />
<br />
In the equations, <math>\,\sigma_{\phi}(x)</math> is a vector of standard deviations <math>\,\pi_{\theta}(x)</math> is a probability vector, and the functions <math>\,\mu_{\phi}(x), \sigma_{\phi}(x) </math> and <math> \,\pi_{\theta}(x)</math> are treated as MLPs for optimization.<br />
<br />
The above algorithm is not more computationally expensive than approaches based on autoencoders or neural models, and has the advantage of being fully probabilistic. The complexity of a single joint update of M<sub>1</sub> can be written as C<sub>M1</sub> = MSC<sub>MLP</sub> where M is the batch size, S is the number of samples of ε and C<sub>MLP</sub> has the form O(KD<sup>2</sup>), where K is the number of layers in the model and D is the average dimension of the layers. The complexity for M<sub>2</sub> has the form LC<sub>M1</sub> where L is the number of labels. All above models can be trained with any of EM algorithm, stochastic gradient variational Bayes, or stochastic backpropagation methods.<br />
<br />
= Results =<br />
<br />
The complexity of M1 can be estimated by using the complexity of the MLP used for the parameters which is equal to <math>C_{MLP} = O(KD^2)</math> with K is the number of layers and D is the average of the neurons in each layer of the network. The total complexity is <math>C_{M1}=MSC_{MLP}</math> with M = size of the mini-batch and S is the number of samples. Similarly the complexity of M2 is <math>C_{M2}=LC_{M1}</math>, where L is the number of labels. Therefore the combined complexity of the model is just a combination of these two complexities. This is equivalent to the lowest complexities of similar models, however, this approach achieves better results as seen in the following table.<br />
<br />
The results are better across all labelled set sizes for the M1+M2 model and drastically better for when the number of labelled data samples is very small (100 out of 50000). <br />
<br />
[[File:kingma_2014_2.png | centre|]]<br />
<br />
The following figure demonstrates the model's ability to generate images through conditional generation. The class label was fixed and then the latent variables, '''z''', were altered. The figure shows how the latent variables were varied and how the generated digits are similar for similar values of '''z'''s. Parts b and c of the figure use a test image to generate images that belong to a similar set of '''z''' values (images that are similar). <br />
<br />
[[File:kingma_2014_3.png |thumb|upright=3|centre|]]<br />
<br />
A commendable part of this paper is that they have actually included their [http://github.com/dpkingma/nips14-ssl source code]. <br />
<br />
= Conclusions and Critique =<br />
<br />
The results using this method are obviously impressive and the fact that the model achieves this with comparable computation times compared to the other models is notable. The heavy use of approximate inference methods shows great promise in improving generative models and thus semi-supervised methods. The authors discuss the potential of combining this method with the supervised methods that have given state-of-the-art results in image processing, convolutional neural networks. This might be possible as all parameters in their models are optimized using neural networks. The final model acts as a approximate Bayesian inference model. <br />
<br />
The architecture of the model is not very explicit in the paper, that is, a diagram showing the layout of the entire model would have ameliorated understanding. Another weak point is that they fail to compare their method to existing tractable inference neural network methods. There is no comparison to Sum Product Networks nor Deep Belief Networks.</div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Neural_Nets_as_a_Method_for_Quantitative_Structure%E2%80%93Activity_Relationships&diff=26842deep Neural Nets as a Method for Quantitative Structure–Activity Relationships2015-11-23T00:56:19Z<p>Mgohari2: </p>
<hr />
<div>== Introduction ==<br />
This abstract is a summary of the paper "Deep Neural Nets as a Method for Quantitative Structure−Activity Relationships" by Ma J. et al. <ref> Ma J, Sheridan R. et al. [ http://pubs.acs.org/doi/pdf/10.1021/ci500747n.pdf "QSAR deep nets"] Journal of Chemical Information and Modeling. 2015,55, 263-274</ref>. The paper presents the application of machine learning methods, specifically Deep Neural Networks <ref> Hinton, G. E.; Osindero, S.; Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation 2006, 18, 1527−1554</ref> and Random Forest models <ref> Breiman L. Random Forests, Machine Learning. 2001,45, 5-32</ref> in the field of pharmaceutical industry. To discover a drug, it is needed that the best combination of different chemical compounds with different molecular structure was selected in order to achieve the best biological activity. Currently the SAR (QSAR) models are routinely used for this purpose. Structure-Activity Relationship (SAR), or Quantified SAR, is an approach designed to find relationships between chemical structure and biological activity (or target property) of studied compounds. The SAR models are type of classification or regression models where the predictors consist of physio-chemical properties or theoretical molecular and the response variable could be a biological activity of the chemicals, such as concentration of a substance required to give a certain biological response. The basic idea behind these methods is that activity of molecules is reflected in their structure and same molecules have the same activity. So if we learn the activity of a set of molecules structures ( or combinations of molecules) then we can predict the activity of similar molecules. QSAR methods are particularly computer intensive or require the adjustment of many sensitive parameters to achieve good prediction.In this sense, the machine learning methods can be helpful and two of those methods: support vector machine (SVM) and random forest (RF) are commonly used <ref>Svetnik, V. et al.,[http://pubs.acs.org/doi/pdf/10.1021/ci034160g.pdf Random forest: a classification and regression tool for compound classification and QSAR modeling,J. Chem. Inf. Comput. Sci.<br />
2003, 43, 1947−1958 </ref>. In this paper the authors investigate the prediction performance of DNN as a QSAR method and compare it with RF performance that is somehow considered as a gold standard in this field. <br />
<br />
<br />
== Motivation ==<br />
At the first stage of drug discovery there are a huge number of candidate compounds that can be combined to produce a new drug. This process may involve a large number of compounds (>100 000) and a large number of descriptors (several thousands) that have different biological activity. Predicting all biological activities for all compounds need a lot number of experiments. The in silico discovery and using the optimization algorithms can substantially reduce the experiment work that need to be done. It was hypothesized that DNN models outperform RF models. <br />
<br />
== Methods ==<br />
In order to compare the prediction performance of methods, DNN and RF fitted to 15 data sets from a pharmaceutical company, Merck. The smallest data set has 2092 molecules with 4596 unique AP, DP descriptors. Each molecule is represented by a list of features, i.e. “descriptors” in QSAR nomenclature. The descriptors are substructure descriptors (e.g., atom pairs (AP), MACCS keys, circular fingerprints, etc.) and donor-descriptors (DP). Both descriptors are of the following form:<br />
<br />
atom type i − (distance in bonds) − atom type j<br />
<br />
Where for AP, atom type includes the element, number of nonhydrogen neighbors, and number of pi electrons. For DP, atom type is one of seven (cation, anion, neutral donor, neutral acceptor, polar, hydrophobe, and other). A separate group of 15 different data sets as Additional Data Sets were used to validate the conclusions acquired from the Kaggle data sets. Each of these data sets was split into train and test set. The metric to evaluate prediction performance of methods is coefficient of determination (<math>R^2</math>). <br />
<br />
To run a RF, 100 trees were generated with m/3 descriptors used at each branch-point, where m was the number of unique descriptors in the training set. The tree nodes with 5 or fewer molecules were not split further. The trees parallelized to run one tree per processor on a cluster to run larger data sets in a reasonable time.<br />
<br />
The DNNs with input descriptors X of a molecule and output of the form <math>O=f(\sum_{i=1}^{N} w_ix_i+b)</math> were fitted to data sets. Since many different parameters, such as number of layers, neurons, influence the performance of a deep neural net, Ma and his colleagues did a sensitivity analysis. They trained 71 DNNs with different parameters for each set of data. the parameters that they were considered were parameters related to: <br />
<br />
-Data (descriptor transformation: no transformation, logarithmic transformation, or binary transformation. <br />
<br />
-Network architecture: number of hidden layers, number of neurons in each hidden layer.<br />
<br />
-Activation functions: sigmoid or rectified linear unit.<br />
<br />
-The DNN training strategy: single training set or joint from multiple sets, percentage of neurons to drop-out in each layer.<br />
<br />
-The mini-batched stochastic gradient descent procedure in the BP algorithm: the minibatch size, number of epochs<br />
<br />
-Control the gradient descent optimization procedure: learning rate, momentum strength, and weight cost strength.<br />
<br />
In addition to the effect of these parameters on the DNN, the authors were interested in evaluating consistency of results for a diverse set of QSAR tasks. Due to time-consuming process of evaluating the effect of the large number of adjustable parameters, a reasonable number of parameter settings were selected by adjusting the values of one or two parameters at a time, and then calculate the <math>R^2</math> for DNNs trained with the selected parameter settings. These results allowed them to focus on a smaller number of parameters, and to finally generate a set of recommended values for all algorithmic parameters, which can lead to consistently good predictions. <br />
<br />
== Results ==<br />
<br />
For the first object of this paper that was comparing the performance of DNNs to Rf, over over 50 DNNs were trained using different parameter settings. These parameter settings were arbitrarily selected, but they attempted to cover a sufficient range of values for each adjustable parameter. Figure 1 shows the difference in <math>R^2</math> between DNNs and RF for each kaggle data set. Each column represents a QSAR data set, and each circle represents the improvement of a DNN over RF.<br />
<br />
<br />
<center><br />
[[File: fig1.PNG | frame | center |Figure 1. Overall DNN vs RF using arbitrarily selected parameter values. Each column represents a QSAR data set, and each circle represents the<br />
improvement, measured in <math>R^2</math>, of a DNN over RF ]]<br />
</center><br />
<br />
comparing the performance of different models shows that even when the worst DNN parameter setting was used for each QSAR task, the average R2 would be degraded only from 0.423 to 0.412, merely a 2.6% reduction. These results suggest that DNNs can generally outperform RF( table below).<br />
<br />
<br />
<center><br />
[[File: table1.PNG | frame | center |Table 1. comparing test <math>R^2</math> of different models ]]<br />
</center><br />
<br />
The difference in <math>R^2</math> between DNN and RF by changing the the network architecture is shown in Figure 2. In order to limit the number of different parameter combinations they fixed the number of neurons in each hidden layer. Thirty two DNNs were trained for each data set by varying number of hidden layers and number of neurons in each layer while the other key adjustable parameters were kept unchanged. It is seen that when the number of hidden layers are two, having a small number of neurons in the layers degrade the predictive capability of DNNs. It can also be seen that, given any number of hidden layers, once the number of neurons per layer is sufficiently large, increasing the number of neurons further has only a marginal benefit. In Figure 2 we can see that the neural network with only one hidden layer and 12 neurons in each layer achieved the same average predictive capability as RF . This size of neural network is indeed comparable with that of the classical neural network used in QSAR.<br />
<br />
<center><br />
[[File: fig2.PNG | frame | center |Figure 2. Impacts of Network Architecture. Each marker in the plot represents a choice of DNN network architecture. The markers share the same number of hidden layers are connected with a line. The measurement (i.e., y-axis) is the difference of the mean R2 between DNNs and RF. ]]<br />
</center><br />
<br />
To decide which activation function, Sigmoid or ReLU, performs better, at least 15 pairs of DNNs were trained for each data set. Each pair of DNNs shared the same adjustable parameter settings, except that one DNN used ReLU as the activation function, while the other used Sigmoid function. The data sets where ReLU is significantly better than Sigmoid are colored in blue, and marked at the bottom with “+”s. The difference was tested by one-sample Wilcoxon test. In contrast, the data set where Sigmoid is significantly better than ReLU is colored in black, and marked at the bottom with “−”s( Figure 3). In 53.3% (8 out of 15) data sets, ReLU is statistically significantly better than Sigmoid. Overall ReLU improves the average <math>R^2</math> over Sigmoid by 0.016. <br />
<br />
<center><br />
[[File: fig3.PNG | frame | center |Figure 3. Choice of activation functions. Each column represents a QSAR data set, and each circle represents the difference, measured in <math>R^2</math>, of a pair of<br />
DNNs trained with ReLU and Sigmoid, respectively ]]<br />
</center><br />
<br />
Figure 4 presents the difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets. Average over all data sets, there seems to joint DNN has a better performance rather single training. However, the size of the training sets plays a critical role on whether a joint DNN is beneficial. For the two very largest data sets (i.e., 3A4 and LOGD), the individual DNNs seem better, indicating that joint DNNs are more proper for not much large data sets. <br />
<br />
<center><br />
[[File: fig4.PNG | frame | center |Figure 4. difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets ]]<br />
</center><br />
<br />
The authors refine their selection of DNN adjustable parameters by studying the results of previous runs. They used the logarithmic transformation, two hidden layers, at least 250 hidden layers an activation function of ReLU. The results are shown in Figure 5. Comparison of these results with those in Figure 1 indicates that now there are 9 out of 15 data sets, whereDNNs outperforms RF even with the “worst” parameter setting, compared with 4 out of 15. The <math>R^2</math> averaged over all DNNs and all 15 data sets is 0.051 higher than that of RF.<br />
<br />
<center><br />
[[File: fig5.PNG | frame | center |Figure 5. DNN vs RF with refined parameter settings ]]<br />
</center><br />
<br />
as a conclusion for the sensitivity analysis which had been done in this work, the authors gave a recommendation on the adjustable parameters of DNNs as below:<br />
-logarithmic transformation. <br />
<br />
-four hidden layers, with number of neurons to be 4000, 2000, 1000, and 1000, respectively.<br />
<br />
-The dropout rates of 0 in the input layer, 25% in the first 3 hidden layer, and 10% in the last hidden layer.<br />
<br />
-The activation function of ReLU.<br />
<br />
-No unsupervised pretraining. The network parameters should be initialized as random values.<br />
<br />
-Large number of epochs.<br />
<br />
-Learning rate of 0.05, momentum strength of 0.9, and weight cost strength of 0.0001.<br />
<br />
To check the consistency of DNNs predictions as was one of concerns of authors, they compared the performance of RF with DNN on 15 additional QSAR data sets. Each additional data set was time-split into training and test sets in the same way as the Kaggle data sets. Individual DNNs were trained from the training set using the recommended parameters.<math>R^2</math> of the DNN and RF were calculated from the test sets. Table below presents the results for the additional data sets. It is seen that the DNN with recommended parameters outperforms RF in 13<br />
out of the 15 additional data sets. The mean <math>R^2</math> of DNNs is 0.411, while that of RFs is 0.361, which is an improvement of 13.9%.<br />
<br />
<center><br />
[[File: table2.PNG | frame | center |Comparing RF with DNN trained using recommended parameter settings on 15 additional datasets]]<br />
</center><br />
<br />
== Discussion ==<br />
This paper demonstrate that DNN in most cases can be used as a practical QSAR method in place of RF which is now as a gold standard in the field of drug discovery. Although, the magnitude of the change in coefficient of determination relative to RF is small in some data sets, on average its better than RF. The paper recommends a set of values for all DNN algorithmic parameters, which are appropriate for large QSAR data sets in an industrial drug discovery environment. The authors gave some recommendation about how RF and DNN can be efficiently sped up using high performance computing technologies. They suggest that RF can be accelerated using coarse parallelization on a cluster by giving one tree per node. In contrast, DNN can efficiently make use of the parallel computation capability of a modern GPU. <br />
<br />
== Future Works ==<br />
<br />
In opposite of our expectation that unsupervised pretraining plays a critical role in the success of DNNs, in this study it had an inverse effect on the performance of QSAR tasks which need to be worked.<br />
Although the paper had some recommendations about the adjustable parameters of DNNs, there is still need to develop an effective and efficient strategy for refining these parameters for each particular QSAR task.<br />
The result of current paper suggested that cross-validation failed to be effective for fine-tuning the algorithmic parameters. Therefore, instead of using automatic methods for tuning DNN parameters, new approaches that can better indicate a DNN’s predictive capability in a time-split test set are needed to be developed.<br />
<br />
== Bibliography ==<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Neural_Nets_as_a_Method_for_Quantitative_Structure%E2%80%93Activity_Relationships&diff=26730deep Neural Nets as a Method for Quantitative Structure–Activity Relationships2015-11-21T06:54:26Z<p>Mgohari2: </p>
<hr />
<div>== Introduction ==<br />
This abstract is a summary of the paper "Deep Neural Nets as a Method for Quantitative Structure−Activity Relationships" by Ma J. et al. <ref> Ma J, Sheridan R. et al. [ http://pubs.acs.org/doi/pdf/10.1021/ci500747n.pdf "QSAR deep nets"] Journal of Chemical Information and Modeling. 2015,55, 263-274</ref>. The paper presents the application of machine learning methods, specifically Deep Neural Networks <ref> Hinton, G. E.; Osindero, S.; Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation 2006, 18, 1527−1554</ref> and Random Forest models <ref> Breiman L. Random Forests, Machine Learning. 2001,45, 5-32</ref> in the field of pharmaceutical industry. To discover a drug, it is needed that the best combination of different chemical compounds with different molecular structure was selected in order to achieve the best biological activity. Currently the SAR (QSAR) models are routinely used for this purpose. Structure-Activity Relationship (SAR), or Quantified SAR, is an approach designed to find relationships between chemical structure and biological activity (or target property) of studied compounds. The SAR models are type of classification or regression models where the predictors consist of physio-chemical properties or theoretical molecular and the response variable could be a biological activity of the chemicals, such as concentration of a substance required to give a certain biological response. The basic idea behind these methods is that activity of molecules is reflected in their structure and same molecules have the same activity. So if we learn the activity of a set of molecules structures ( or combinations of molecules) then we can predict the activity of similar molecules. QSAR methods are particularly computer intensive or require the adjustment of many sensitive parameters to achieve good prediction.In this sense, the machine learning methods can be helpful and two of those methods: support vector machine (SVM) and random forest (RF) are commonly used <ref>Svetnik, V. et al.,[http://pubs.acs.org/doi/pdf/10.1021/ci034160g.pdf Random forest: a classification and regression tool for compound classification and QSAR modeling,J. Chem. Inf. Comput. Sci.<br />
2003, 43, 1947−1958 </ref>. In this paper the authors investigate the prediction performance of DNN as a QSAR method and compare it with RF performance that is somehow considered as a gold standard in this field. <br />
<br />
<br />
== Motivation ==<br />
At the first stage of drug discovery there are a huge number of candidate compounds that can be combined to produce a new drug. This process may involve a large number of compounds (>100 000) and a large number of descriptors (several thousands) that have different biological activity. Predicting all biological activities for all compounds need a lot number of experiments. The in silico discovery and using the optimization algorithms can substantially reduce the experiment work that need to be done. It was hypothesized that DNN models outperform RF models. <br />
<br />
== Methods ==<br />
In order to compare the prediction performance of methods, DNN and RF fitted to 15 data sets from a pharmaceutical company, Merck. The smallest data set has 2092 molecules with 4596 unique AP, DP descriptors. Each molecule is represented by a list of features, i.e. “descriptors” in QSAR nomenclature. The descriptors are substructure descriptors (e.g., atom pairs (AP), MACCS keys, circular fingerprints, etc.) and donor-descriptors (DP). Both descriptors are of the following form:<br />
<br />
atom type i − (distance in bonds) − atom type j<br />
<br />
Where for AP, atom type includes the element, number of nonhydrogen neighbors, and number of pi electrons. For DP, atom type is one of seven (cation, anion, neutral donor, neutral acceptor, polar, hydrophobe, and other). A separate group of 15 different data sets as Additional Data Sets were used to validate the conclusions acquired from the Kaggle data sets. Each of these data sets was split into train and test set. The metric to evaluate prediction performance of methods is coefficient of determination (<math>R^2</math>. <br />
<br />
To run a RF, 100 trees were generated with m/3 descriptors used at each branch-point, where m was the number of unique descriptors in the training set. The tree nodes with 5 or fewer molecules were not split further. The trees parallelized to run one tree per processor on a cluster to run larger data sets in a reasonable time.<br />
<br />
The DNNs with input descriptors X of a molecule and output of the form <math>O=f(\sum{i=1}{N}w_ix_i+b)</math> were fitted to data sets. Since many different parameters, such as number of layers, neurons, influence the performance of a deep neural net, Ma and his colleagues did a sensitivity analysis. They trained 71 DNNs with different parameters for each set of data. the parameters that they were considered were parameters related to: <br />
<br />
-Data (descriptor transformation: no transformation, logarithmic transformation, or binary transformation. <br />
<br />
-Network architecture: number of hidden layers, number of neurons in each hidden layer.<br />
<br />
-Activation functions: sigmoid or rectified linear unit.<br />
<br />
-The DNN training strategy: single training set or joint from multiple sets, percentage of neurons to drop-out in each layer.<br />
<br />
-The mini-batched stochastic gradient descent procedure in the BP algorithm: the minibatch size, number of epochs<br />
<br />
-Control the gradient descent optimization procedure: learning rate, momentum strength, and weight cost strength.<br />
<br />
In addition to the effect of these parameters on the DNN, the authors were interested in evaluating consistency of results for a diverse set of QSAR tasks. Due to time-consuming process of evaluating the effect of the large number of adjustable parameters, a reasonable number of parameter settings were selected by adjusting the values of one or two parameters at a time, and then calculate the <math>R^2</math> for DNNs trained with the selected parameter settings. These results allowed them to focus on a smaller number of parameters, and to finally generate a set of recommended values for all algorithmic parameters, which can lead to consistently good predictions. <br />
<br />
== Results ==<br />
<br />
For the first object of this paper that was comparing the performance of DNNs to Rf, over over 50 DNNs were trained using different parameter settings. These parameter settings were arbitrarily selected, but they attempted to cover a sufficient range of values for each adjustable parameter. Figure 1 shows the difference in <math>R^2</math> between DNNs and RF for each kaggle data set. Each column represents a QSAR data set, and each circle represents the improvement of a DNN over RF.<br />
<br />
<br />
<center><br />
[[File: fig1.PNG | frame | center |Figure 1. Overall DNN vs RF using arbitrarily selected parameter values. Each column represents a QSAR data set, and each circle represents the<br />
improvement, measured in <math>R^2</math>, of a DNN over RF ]]<br />
</center><br />
<br />
comparing the performance of different models shows that even when the worst DNN parameter setting was used for each QSAR task, the average R2 would be degraded only from 0.423 to 0.412, merely a 2.6% reduction. These results suggest that DNNs can generally outperform RF( table below).<br />
<br />
<br />
<center><br />
[[File: table1.PNG | frame | center |Table 1. comparing test <math>R^2</math> of different models ]]<br />
</center><br />
<br />
The difference in <math>R^2</math> between DNN and RF by changing the the network architecture is shown in Figure 2. In order to limit the number of different parameter combinations they fixed the number of neurons in each hidden layer. Thirty two DNNs were trained for each data set by varying number of hidden layers and number of neurons in each layer while the other key adjustable parameters were kept unchanged. It is seen that when the number of hidden layers are two, having a small number of neurons in the layers degrade the predictive capability of DNNs. It can also be seen that, given any number of hidden layers, once the number of neurons per layer is sufficiently large, increasing the number of neurons further has only a marginal benefit. In Figure 2 we can see that the neural network with only one hidden layer and 12 neurons in each layer achieved the same average predictive capability as RF . This size of neural network is indeed comparable with that of the classical neural network used in QSAR.<br />
<br />
<center><br />
[[File: fig2.PNG | frame | center |Figure 2. Impacts of Network Architecture. Each marker in the plot represents a choice of DNN network architecture. The markers share the same number of hidden layers are connected with a line. The measurement (i.e., y-axis) is the difference of the mean R2 between DNNs and RF. ]]<br />
</center><br />
<br />
To decide which activation function, Sigmoid or ReLU, performs better, at least 15 pairs of DNNs were trained for each data set. Each pair of DNNs shared the same adjustable parameter settings, except that one DNN used ReLU as the activation function, while the other used Sigmoid function. The data sets where ReLU is significantly better than Sigmoid are colored in blue, and marked at the bottom with “+”s. The difference was tested by one-sample Wilcoxon test. In contrast, the data set where Sigmoid is significantly better than ReLU is colored in black, and marked at the bottom with “−”s( Figure 3). In 53.3% (8 out of 15) data sets, ReLU is statistically significantly better than Sigmoid. Overall ReLU improves the average <math>R^2</math> over Sigmoid by 0.016. <br />
<br />
<center><br />
[[File: fig3.PNG | frame | center |Figure 3. Choice of activation functions. Each column represents a QSAR data set, and each circle represents the difference, measured in <math>R^2</math>, of a pair of<br />
DNNs trained with ReLU and Sigmoid, respectively ]]<br />
</center><br />
<br />
Figure 4 presents the difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets. Average over all data sets, there seems to joint DNN has a better performance rather single training. However, the size of the training sets plays a critical role on whether a joint DNN is beneficial. For the two very largest data sets (i.e., 3A4 and LOGD), the individual DNNs seem better, indicating that joint DNNs are more proper for not much large data sets. <br />
<br />
<center><br />
[[File: fig4.PNG | frame | center |Figure 4. difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets ]]<br />
</center><br />
<br />
The authors refine their selection of DNN adjustable parameters by studying the results of previous runs. They used the logarithmic transformation, two hidden layers, at least 250 hidden layers an activation function of ReLU. The results are shown in Figure 5. Comparison of these results with those in Figure 1 indicates that now there are 9 out of 15 data sets, whereDNNs outperforms RF even with the “worst” parameter setting, compared with 4 out of 15. The <math>R^2</math> averaged over all DNNs and all 15 data sets is 0.051 higher than that of RF.<br />
<br />
<center><br />
[[File: fig5.PNG | frame | center |Figure 5. DNN vs RF with refined parameter settings ]]<br />
</center><br />
<br />
as a conclusion for the sensitivity analysis which had been done in this work, the authors gave a recommendation on the adjustable parameters of DNNs as below:<br />
-logarithmic transformation. <br />
<br />
-four hidden layers, with number of neurons to be 4000, 2000, 1000, and 1000, respectively.<br />
<br />
-The dropout rates of 0 in the input layer, 25% in the first 3 hidden layer, and 10% in the last hidden layer.<br />
<br />
-The activation function of ReLU.<br />
<br />
-No unsupervised pretraining. The network parameters should be initialized as random values.<br />
<br />
-Large number of epochs.<br />
<br />
-Learning rate of 0.05, momentum strength of 0.9, and weight cost strength of 0.0001.<br />
<br />
To check the consistency of DNNs predictions as was one of concerns of authors, they compared the performance of RF with DNN on 15 additional QSAR data sets. Each additional data set was time-split into training and test sets in the same way as the Kaggle data sets. Individual DNNs were trained from the training set using the recommended parameters.<math>R^2</math> of the DNN and RF were calculated from the test sets. Table below presents the results for the additional data sets. It is seen that the DNN with recommended parameters outperforms RF in 13<br />
out of the 15 additional data sets. The mean <math>R^2</math> of DNNs is 0.411, while that of RFs is 0.361, which is an improvement of 13.9%.<br />
<br />
<center><br />
[[File: table2.PNG | frame | center |Comparing RF with DNN trained using recommended parameter settings on 15 additional datasets]]<br />
</center><br />
<br />
== Discussion ==<br />
This paper demonstrate that DNN in most cases can be used as a practical QSAR method in place of RF which is now as a gold standard in the field of drug discovery. Although, the magnitude of the change in coefficient of determination relative to RF is small in some data sets, on average its better than RF. The paper recommends a set of values for all DNN algorithmic parameters, which are appropriate for large QSAR data sets in an industrial drug discovery environment. The authors gave some recommendation about how RF and DNN can be efficiently sped up using high performance computing technologies. They suggest that RF can be accelerated using coarse parallelization on a cluster by giving one tree per node. In contrast, DNN can efficiently make use of the parallel computation capability of a modern GPU. <br />
<br />
== Future Works ==<br />
<br />
In opposite of our expectation that unsupervised pretraining plays a critical role in the success of DNNs, in this study it had an inverse effect on the performance of QSAR tasks which need to be worked.<br />
Although the paper had some recommendations about the adjustable parameters of DNNs, there is still need to develop an effective and efficient strategy for refining these parameters for each particular QSAR task.<br />
The result of current paper suggested that cross-validation failed to be effective for fine-tuning the algorithmic parameters. Therefore, instead of using automatic methods for tuning DNN parameters, new approaches that can better indicate a DNN’s predictive capability in a time-split test set are needed to be developed.<br />
<br />
== Bibliography ==<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Neural_Nets_as_a_Method_for_Quantitative_Structure%E2%80%93Activity_Relationships&diff=26729deep Neural Nets as a Method for Quantitative Structure–Activity Relationships2015-11-21T06:51:36Z<p>Mgohari2: </p>
<hr />
<div>== Introduction ==<br />
This abstract is a summary of the paper "Deep Neural Nets as a Method for Quantitative Structure−Activity Relationships" by Ma J. et al. <ref> Ma J, Sheridan R. et al. [ http://pubs.acs.org/doi/pdf/10.1021/ci500747n.pdf "QSAR deep nets"] Journal of Chemical Information and Modeling. 2015,55, 263-274</ref>. The paper presents the application of machine learning methods, specifically Deep Neural Networks <ref> Hinton, G. E.; Osindero, S.; Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation 2006, 18, 1527−1554</ref> and Random Forest models <ref> Breiman L. Random Forests, Machine Learning. 2001,45, 5-32</ref> in the field of pharmaceutical industry. To discover a drug, it is needed that the best combination of different chemical compounds with different molecular structure was selected in order to achieve the best biological activity. Currently the SAR (QSAR) models are routinely used for this purpose. Structure-Activity Relationship (SAR), or Quantified SAR, is an approach designed to find relationships between chemical structure and biological activity (or target property) of studied compounds. The SAR models are type of classification or regression models where the predictors consist of physio-chemical properties or theoretical molecular and the response variable could be a biological activity of the chemicals, such as concentration of a substance required to give a certain biological response. The basic idea behind these methods is that activity of molecules is reflected in their structure and same molecules have the same activity. So if we learn the activity of a set of molecules structures ( or combinations of molecules) then we can predict the activity of similar molecules. QSAR methods are particularly computer intensive or require the adjustment of many sensitive parameters to achieve good prediction.In this sense, the machine learning methods can be helpful and two of those methods: support vector machine (SVM) and random forest (RF) are commonly used <ref>Svetnik, V. et al.,[http://pubs.acs.org/doi/pdf/10.1021/ci034160g.pdf Random forest: a classification and regression tool for compound classification and QSAR modeling,J. Chem. Inf. Comput. Sci.<br />
2003, 43, 1947−1958 </ref>. In this paper the authors investigate the prediction performance of DNN as a QSAR method and compare it with RF performance that is somehow considered as a gold standard in this field. <br />
<br />
<br />
== Motivation ==<br />
At the first stage of drug discovery there are a huge number of candidate compounds that can be combined to produce a new drug. This process may involve a large number of compounds (>100 000) and a large number of descriptors (several thousands) that have different biological activity. Predicting all biological activities for all compounds need a lot number of experiments. The in silico discovery and using the optimization algorithms can substantially reduce the experiment work that need to be done. It was hypothesized that DNN models outperform RF models. <br />
<br />
== Methods ==<br />
In order to compare the prediction performance of methods, DNN and RF fitted to 15 data sets from a pharmaceutical company, Merck. The smallest data set has 2092 molecules with 4596 unique AP, DP descriptors. Each molecule is represented by a list of features, i.e. “descriptors” in QSAR nomenclature. The descriptors are substructure descriptors (e.g., atom pairs (AP), MACCS keys, circular fingerprints, etc.) and donor-descriptors (DP). Both descriptors are of the following form:<br />
<br />
atom type i − (distance in bonds) − atom type j<br />
<br />
Where for AP, atom type includes the element, number of nonhydrogen neighbors, and number of pi electrons. For DP, atom type is one of seven (cation, anion, neutral donor, neutral acceptor, polar, hydrophobe, and other). A separate group of 15 different data sets as Additional Data Sets were used to validate the conclusions acquired from the Kaggle data sets. Each of these data sets was split into train and test set. The metric to evaluate prediction performance of methods is coefficient of determination (<math>R^2</math>. <br />
<br />
To run a RF, 100 trees were generated with m/3 descriptors used at each branch-point, where m was the number of unique descriptors in the training set. The tree nodes with 5 or fewer molecules were not split further. The trees parallelized to run one tree per processor on a cluster to run larger data sets in a reasonable time.<br />
<br />
The DNNs with input descriptors X of a molecule and output of the form <math>O=f(\sum{i=1}{N}w_ix_i+b)</math> were fitted to data sets. Since many different parameters, such as number of layers, neurons, influence the performance of a deep neural net, Ma and his colleagues did a sensitivity analysis. They trained 71 DNNs with different parameters for each set of data. the parameters that they were considered were parameters related to: <br />
<br />
-Data (descriptor transformation: no transformation, logarithmic transformation, or binary transformation. <br />
-Network architecture: number of hidden layers, number of neurons in each hidden layer.<br />
-Activation functions: sigmoid or rectified linear unit.<br />
-The DNN training strategy: single training set or joint from multiple sets, percentage of neurons to drop-out in each layer.<br />
-The mini-batched stochastic gradient descent procedure in the BP algorithm: the minibatch size, number of epochs<br />
-Control the gradient descent optimization procedure: learning rate, momentum strength, and weight cost strength.<br />
<br />
In addition to the effect of these parameters on the DNN, the authors were interested in evaluating consistency of results for a diverse set of QSAR tasks. Due to time-consuming process of evaluating the effect of the large number of adjustable parameters, a reasonable number of parameter settings were selected by adjusting the values of one or two parameters at a time, and then calculate the <math>R^2</math> for DNNs trained with the selected parameter settings. These results allowed them to focus on a smaller number of parameters, and to finally generate a set of recommended values for all algorithmic parameters, which can lead to consistently good predictions. <br />
<br />
== Results ==<br />
<br />
For the first object of this paper that was comparing the performance of DNNs to Rf, over over 50 DNNs were trained using different parameter settings. These parameter settings were arbitrarily selected, but they attempted to cover a sufficient range of values for each adjustable parameter. Figure 1 shows the difference in <math>R^2</math> between DNNs and RF for each kaggle data set. Each column represents a QSAR data set, and each circle represents the improvement of a DNN over RF.<br />
<br />
<br />
<center><br />
[[File: fig1.PNG | frame | center |Figure 1. Overall DNN vs RF using arbitrarily selected parameter values. Each column represents a QSAR data set, and each circle represents the<br />
improvement, measured in <math>R^2</math>, of a DNN over RF ]]<br />
</center><br />
<br />
comparing the performance of different models shows that even when the worst DNN parameter setting was used for each QSAR task, the average R2 would be degraded only from 0.423 to 0.412, merely a 2.6% reduction. These results suggest that DNNs can generally outperform RF( table below).<br />
<br />
<br />
<center><br />
[[File: table1.PNG | frame | center |Table 1. comparing test <math>R^2</math> of different models ]]<br />
</center><br />
<br />
The difference in <math>R^2</math> between DNN and RF by changing the the network architecture is shown in Figure 2. In order to limit the number of different parameter combinations they fixed the number of neurons in each hidden layer. Thirty two DNNs were trained for each data set by varying number of hidden layers and number of neurons in each layer while the other key adjustable parameters were kept unchanged. It is seen that when the number of hidden layers are two, having a small number of neurons in the layers degrade the predictive capability of DNNs. It can also be seen that, given any number of hidden layers, once the number of neurons per layer is sufficiently large, increasing the number of neurons further has only a marginal benefit. In Figure 2 we can see that the neural network with only one hidden layer and 12 neurons in each layer achieved the same average predictive capability as RF . This size of neural network is indeed comparable with that of the classical neural network used in QSAR.<br />
<br />
<center><br />
[[File: fig2.PNG | frame | center |Figure 2. Impacts of Network Architecture. Each marker in the plot represents a choice of DNN network architecture. The markers share the same number of hidden layers are connected with a line. The measurement (i.e., y-axis) is the difference of the mean R2 between DNNs and RF. ]]<br />
</center><br />
<br />
To decide which activation function, Sigmoid or ReLU, performs better, at least 15 pairs of DNNs were trained for each data set. Each pair of DNNs shared the same adjustable parameter settings, except that one DNN used ReLU as the activation function, while the other used Sigmoid function. The data sets where ReLU is significantly better than Sigmoid are colored in blue, and marked at the bottom with “+”s. The difference was tested by one-sample Wilcoxon test. In contrast, the data set where Sigmoid is significantly better than ReLU is colored in black, and marked at the bottom with “−”s( Figure 3). In 53.3% (8 out of 15) data sets, ReLU is statistically significantly better than Sigmoid. Overall ReLU improves the average <math>R^2</math> over Sigmoid by 0.016. <br />
<br />
<center><br />
[[File: fig3.PNG | frame | center |Figure 3. Choice of activation functions. Each column represents a QSAR data set, and each circle represents the difference, measured in <math>R^2</math>, of a pair of<br />
DNNs trained with ReLU and Sigmoid, respectively ]]<br />
</center><br />
<br />
Figure 4 presents the difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets. Average over all data sets, there seems to joint DNN has a better performance rather single training. However, the size of the training sets plays a critical role on whether a joint DNN is beneficial. For the two very largest data sets (i.e., 3A4 and LOGD), the individual DNNs seem better, indicating that joint DNNs are more proper for not much large data sets. <br />
<br />
<center><br />
[[File: fig4.PNG | frame | center |Figure 4. difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets ]]<br />
</center><br />
<br />
The authors refine their selection of DNN adjustable parameters by studying the results of previous runs. They used the logarithmic transformation, two hidden layers, at least 250 hidden layers an activation function of ReLU. The results are shown in Figure 5. Comparison of these results with those in Figure 1 indicates that now there are 9 out of 15 data sets, whereDNNs outperforms RF even with the “worst” parameter setting, compared with 4 out of 15. The <math>R^2</math> averaged over all DNNs and all 15 data sets is 0.051 higher than that of RF.<br />
<br />
<center><br />
[[File: fig5.PNG | frame | center |Figure 5. DNN vs RF with refined parameter settings ]]<br />
</center><br />
<br />
as a conclusion for the sensitivity analysis which had been done in this work, the authors gave a recommendation on the adjustable parameters of DNNs as below:<br />
-logarithmic transformation. <br />
-four hidden layers, with number of neurons to be 4000, 2000, 1000, and 1000, respectively.<br />
-The dropout rates of 0 in the input layer, 25% in the first 3 hidden layer, and 10% in the last hidden layer.<br />
-The activation function of ReLU.<br />
-No unsupervised pretraining. The network parameters should be initialized as random values.<br />
-Large number of epochs.<br />
-Learning rate of 0.05, momentum strength of 0.9, and weight cost strength of 0.0001.<br />
<br />
To check the consistency of DNNs predictions as was one of concerns of authors, they compared the performance of RF with DNN on 15 additional QSAR data sets. Each additional data set was time-split into training and test sets in the same way as the Kaggle data sets. Individual DNNs were trained from the training set using the recommended parameters.<math>R^2</math> of the DNN and RF were calculated from the test sets. Table below presents the results for the additional data sets. It is seen that the DNN with recommended parameters outperforms RF in 13<br />
out of the 15 additional data sets. The mean <math>R^2</math> of DNNs is 0.411, while that of RFs is 0.361, which is an improvement of 13.9%.<br />
<br />
<center><br />
[[File: table2.PNG | frame | center |Comparing RF with DNN trained using recommended parameter settings on 15 additional datasets]]<br />
</center><br />
<br />
== Discussion ==<br />
This paper demonstrate that DNN in most cases can be used as a practical QSAR method in place of RF which is now as a gold standard in the field of drug discovery. Although, the magnitude of the change in coefficient of determination relative to RF is small in some data sets, on average its better than RF. The paper recommends a set of values for all DNN algorithmic parameters, which are appropriate for large QSAR data sets in an industrial drug discovery environment. The authors gave some recommendation about how RF and DNN can be efficiently sped up using high performance computing technologies. They suggest that RF can be accelerated using coarse parallelization on a cluster by giving one tree per node. In contrast, DNN can efficiently make use of the parallel computation capability of a modern GPU. <br />
<br />
== Future Works ==<br />
<br />
In opposite of our expectation that unsupervised pretraining plays a critical role in the success of DNNs, in this study it had an inverse effect on the performance of QSAR tasks which need to be worked.<br />
Although the paper had some recommendations about the adjustable parameters of DNNs, there is still need to develop an effective and efficient strategy for refining these parameters for each particular QSAR task.<br />
The result of current paper suggested that cross-validation failed to be effective for fine-tuning the algorithmic parameters. Therefore, instead of using automatic methods for tuning DNN parameters, new approaches that can better indicate a DNN’s predictive capability in a time-split test set are needed to be developed.<br />
<br />
== Bibliography ==<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Neural_Nets_as_a_Method_for_Quantitative_Structure%E2%80%93Activity_Relationships&diff=26727deep Neural Nets as a Method for Quantitative Structure–Activity Relationships2015-11-21T03:15:19Z<p>Mgohari2: </p>
<hr />
<div>== Introduction ==<br />
This abstract is a summary of the paper Deep Neural Nets as a Method for Quantitative Structure−Activity Relationships by Ma J. et al., which published in the Journal of Chemical Information and Modeling <ref> Ma J, Sheridan R. et al. [ http://pubs.acs.org/doi/pdf/10.1021/ci500747n.pdf "QSAR deep nets"] Journal of Chemical Information and Modeling. 2015,55, 263-274</ref>. The paper presents the application of machine learning methods, specifically Deep Neural networks <ref> Hinton, G. E.; Osindero, S.; Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation 2006, 18, 1527−1554</ref> and Random Forest models <ref> Breiman L. Random Forests, Machine Learning. 2001,45, 5-32</ref> in the field of pharmaceutics. To discover a drug, it is needed to combine a large number of different chemical compounds with different molecular structure to be able to select the best combination based on its biological activity. Currently the SAR(QSAR) models are routinely used for this purpose. Structure-Activity Relationship (SAR) is an approach designed to find relationships between chemical structure and biological activity (or target property) of studied compounds. The SAR models are type of classification or regression models where the predictors consist of physio-chemical properties or theoretical molecular and the response variable could be a biological activity of the chemicals, such as concentration of a substance required to give a certain biological response. The basic idea behind these methods is that activity of molecules is reflected in their structure and same molecules have the same activity. So if we learn the activity of a set of molecules structures ( or combinations) then we can predict the activity of similar molecules. QSAR methods are particularly computer intensive or require the adjustment of many sensitive parameters to achieve good prediction.In this sense, the machine learning methods can be helpful and two of those methods: support vector machine (SVM) and random forest (RF) are commonly used <ref>Svetnik, V. et al.,[http://pubs.acs.org/doi/pdf/10.1021/ci034160g.pdf Random forest: a classification and regression tool for compound classification and QSAR modeling,J. Chem. Inf. Comput. Sci.<br />
2003, 43, 1947−1958 </ref>. In this paper the authors would like to investigate the prediction performance of DNN as a QSAR method and compare it with RF performance that is somehow considered as a gold standard in this field. <br />
less attractive. <br />
<br />
== Motivation ==<br />
At the first of stage of drug discovery there are a huge number of candidate compounds that can be combined to produce a new drug. This process may involve a large number of compounds (>100 000) and a large number of descriptors (several thousands) that can have different biological activity. Predicting all biological activity of all compounds need a super huge number of experiments. The in silico discovery and using the optimization algorithms can substantially reduce the experiment work that need to be done. In this paper the performance of deep neural nets and random forest evaluated in predicting the biological activity of different descriptors when the methods are applied to 30 pharmaceutical data set. The performance of two approach are also compared using coefficient of determination.<br />
<br />
== Methods ==<br />
In order to compare the prediction performance of methods, DNN and RF fitted to 15 data sets from a pharmaceutical company, Merck. The smallest data set has 2092 molecules with 4596 unique AP, DP descriptors. Each molecule is represented by a list of features, i.e. “descriptors” in QSAR nomenclature. The descriptors are substructure descriptors (e.g., atom pairs (AP), MACCS keys, circular fingerprints, etc.) and donor-descriptors (DP). Both descriptors are of the following form:<br />
<br />
atom type i − (distance in bonds) − atom type j<br />
<br />
For AP, atom type includes the element, number of nonhydrogen neighbors, and number of pi electrons. For DP, atom type is one of seven (cation, anion, neutral donor, neutral acceptor, polar, hydrophobe, and other). A separate group of 15 different data sets labeled “Additional Data Sets” are used to validate the conclusions acquired from the Kaggle data sets. Each of these data sets was split into train and test set. The metric to evaluate prediction performance of methods is coefficient of determination (<math>R_2</math>. <br />
<br />
To run a RF, 100 trees were generated with m/3 descriptors used at each branch-point, where m was the number of unique descriptors in the training set. The tree nodes with 5 or fewer molecules were not split further. The trees parallelized to run one tree per processor on a cluster to run larger data sets in a reasonable time.<br />
<br />
The DNNs with input descriptors X of a molecule and output of the <math>O=f(\sum{i=1}{N}w_ix_i+b)</math> were fitted to data sets. Considering effect of many parameters influence the performance of a deep neural net, Ma and his colleagues did a sensitivity analysis and trained 71 DNNs with different parameters for each set of data. the parameters that they were considered were parameters related to data (Options for descriptor transformation: (1) no transformation, (2) logarithmic transformation, (3) binary transformation. Related to network architecture ( number of hidden layers, number of neurons in each hidden layer, activation functions: sigmoid or rectified linear unit ). parameters related to the DNN training strategy (single training set or joint from multiple sets,percentage of neurons to drop-out in each layer. they also considered the parameters Related to the mini-batched stochastic gradient descent procedure in the BP algorithm ( the minibatch<br />
size, number of epochs) and parameters to control the gradient descent optimization procedure (learning rate, momentum strength, and weight cost strength). In addition to the effect of these parameters on the DNN, The authors were interested in evaluating stability of results for a diverse set of QSAR tasks. Due to time-consuming process of evaluating the effect of the large number of adjustable parameters, a reasonable number of parameter settings were selected by adjusting the values of one or two parameters at a time, and then calculate the <math>R_2</math> DNNs trained with the selected parameter settings. These results allowed them to focus on a smaller number of parameters, and to finally generate a set of recommended values for all algorithmic parameters, which can lead to consistently good predictions. <br />
<br />
== Results ==<br />
For the first object of this paper that was comparing the performance of DNNs to Rf, over over 50 DNNs were trained using different parameter settings. These parameter settings were arbitrarily selected, but they attempted to cover a sufficient range of values of each adjustable parameter. Figure below shows the difference in <math>R_2</math> between DNNs and RF for each data set. Each column represents a QSAR data set, and each circle represents the improvement of a DNN over RF.<br />
<br />
<br />
<center><br />
[[File: fig1.PNG | frame | center |Overall DNN vs RF using arbitrarily selected parameter values. Each column represents a QSAR data set, and each circle represents the<br />
improvement, measured in <math>R^2</math>, of a DNN over RF ]]<br />
</center><br />
<br />
comparing the performance of different models shows that even when the worst DNN parameter setting was used for each QSAR task, the average R2 would be degraded only from 0.423 to 0.412, merely a 2.6% reduction. These results suggest that DNNs can generally outperform RF( table below).<br />
<br />
<br />
<center><br />
[[File: table1.PNG | frame | center |comparing test <math>R^2</math> of different models ]]<br />
</center><br />
<br />
The difference in <math>R^2</math> between DNN and RF by changing the the network architecture is shown in Figure 2. In order to limit the number of different parameter combinations they fixed the number of neurons in each hidden layer. Thirty two DNNs were trained for each data set by varying number of hidden layers and number of neurons in each layer while the other key adjustable parameters were kept unchanged. It is seen that when the number of hidden layer is two, having a small number of neurons in the layers degrades the predictive capability of DNNs. It can also be seen that, given any number of hidden layers, once the number of neurons per layer is sufficiently large, increasing the number of neurons further has only a marginal benefit. In Figure 2 we can see that the neural network achieved the same average predictive capability as RF when the network has only one hidden layer with 12 neurons. This size of neural network is indeed comparable with that of the classical neural network used in QSAR.<br />
<br />
<center><br />
[[File: fig2.PNG | frame | center |Figure 2. Impacts of Network Architecture. Each marker in the plot represents a choice of DNN network architecture. The markers share the same number of hidden layers are connected with a line. The measurement (i.e., y-axis) is the difference of the mean R2 between DNNs and RF. ]]<br />
</center><br />
<br />
To decide which activation function, Sigmoid or ReLU, at least 15 pairs of DNNs were trained For each data set. Each pair of DNNs shared the same adjustable parameter settings, except that one DNN used ReLU as the activation function, while the other used Sigmoid function. The data sets where ReLU is significantly 9 the difference was tested by one-sample Wilcoxon test) better than Sigmoid are colored in blue, and marked at the bottom with “+”s. In contrast, the data set where Sigmoid is significantly better than ReLU is colored in black, and marked at the bottom with “−”s( Figure 6). In 53.3% (out of 15) data sets, ReLU is statistically significantly better than Sigmoid. Overall ReLU improves the average <math>R^2</math> over Sigmoid by 0.016. <br />
<br />
<center><br />
[[File: fig3.PNG | frame | center |Figure 3. Choice of activation functions. Each column represents a QSAR data set, and each circle represents the difference, measured in <math>R^2</math>, of a pair of<br />
DNNs trained with ReLU and Sigmoid, respectively ]]<br />
</center><br />
<br />
Figure 4 presents the difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets. On average over all data sets, there seems to joint DNN has a better performance. However, the size of the training sets plays a critical role on whether a joint DNN is beneficial. For the two very largest data sets (i.e., 3A4 and LOGD), the individual DNNs seem better, indicating that joint DNNs are more proper for not much large data sets. <br />
<br />
<center><br />
[[File: fig4.PNG | frame | center |Figure 4. difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets ]]<br />
</center><br />
<br />
The authors refine their selection of the DNN adjustable parameters by studying the previous results. they used the logarithmic transformation, two hidden layers, at least 250 hidden layers an activation function of ReLU. The results are shown in Figure 5. Comparing these results with those in Figure 1 indicate that there are now 9 out of 15 data sets, whereDNNs outperforms RF even with the “worst” parameter setting, compared with 4 out of 15. The <math>R^2</math> averaged over all DNNs and all 15 data sets is 0.051 higher than that of RF.<br />
<br />
<center><br />
[[File: fig5.PNG | frame | center |Figure 5. DNN vs RF with refined parameter settings ]]<br />
</center><br />
<br />
as a conclusion for sensitivity analysis has been done in this work, the authors gave a recommendation on the adjustable parameters of DNNs as below:<br />
-logarithmic transformation <br />
-four hidden layers, with hidden layers to be 4000, 2000, 1000, and 1000, respectively<br />
-The dropout rates of 0 in the input layer, 25% in the first 3 hidden layer, and 10% in the last hidden layer<br />
-The activation function of ReLU<br />
-No unsupervised pretraining. The network parameters should be initialized as random values<br />
-Large number of epochs.<br />
-Learning rate of 0.05, momentum strength of 0.9, and weight cost strength of 0.0001.<br />
<br />
To check the consistency of DNNs predictions as was one of concerns of authors they compared the performance of RF with DNN on 15 additional QSAR data sets were arbitrarily selected from in-house data. Each additional data set was time-split into training and test sets in the same way as the Kaggle data sets. Individual DNNs were trained from the training set using the recommended parameters, and the test <math>R^2</math> of the DNN and RF were calculated from the test sets. Table below presents the results for the additional data sets. It is seen that the DNN with recommended parameters outperforms RF in 13<br />
out of the 15 additional data sets. The mean R2 of DNNs is 0.411, while that of RFs is 0.361, which is an improvement of 13.9%.<br />
<br />
<center><br />
[[File: table2.PNG | frame | center |Comparing RF with DNN trained using recommended parameter settings on 15 additional datasets]]<br />
</center><br />
<br />
== Discussion ==<br />
This paper demonstrate that DNN can be used as a practical QSAR method in place of RF which is now as a gold standard in the field of drug discovery in most cases. Although, the magnitude of the change in coefficient of determination relative to RF may is small in some datasets, on average its better than RF. The paper recommends a set of values for all DNN algorithmic parameters, which are appropriate for large QSAR data sets in an industrial drug discovery environment. The authors give some recommendation about how RF and DNN can be efficiently speeded up using highperformance computing technologies. RF can be<br />
accelerated using coarse parallelization on a cluster by giving one tree per node. In contrast, DNN can efficiently make use of the parallel computation capability of a modern GPU. <br />
<br />
In opposite of our expectation that unsupervised pretraining helps plays a critical role in the success of DNNs, have an inverse effect on the performance of QSAR tasks which need to be worked.<br />
Another future work is to develop an effective and efficient strategy for refining the adjustable parameters of DNNs for each particular QSAR task.<br />
This result of current paper suggested that cross-validation failed to be effective for fine-tuning the algorithmic parameters. Therefore, instead of using automatic methods for tuning DNN parameters, new approaches that can better indicate a DNN’s predictive capability in a time-split test set need<br />
to be developed before we can maximize the benefit of DNNs<br />
<br />
<br />
== Bibliography ==<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Neural_Nets_as_a_Method_for_Quantitative_Structure%E2%80%93Activity_Relationships&diff=26726deep Neural Nets as a Method for Quantitative Structure–Activity Relationships2015-11-21T02:50:35Z<p>Mgohari2: Created page with "== Introduction == This abstract is a summary of the paper Deep Neural Nets as a Method for Quantitative Structure−Activity Relationships by Ma J. et al., which published in th..."</p>
<hr />
<div>== Introduction ==<br />
This abstract is a summary of the paper Deep Neural Nets as a Method for Quantitative Structure−Activity Relationships by Ma J. et al., which published in the Journal of Chemical Information and Modeling. The paper presents the application of machine learning methods, specifically Deep Neural networks and Random Forest models in the field of pharmaceutics. To discover a drug, it is needed to combine a large number of different chemical compounds with different molecular structure to be able to select the best combination based on its biological activity. Currently the SAR(QSAR) models are routinely used for this purpose. Structure-Activity Relationship (SAR) is an approach designed to find relationships between chemical structure and biological activity (or target property) of studied compounds. The SAR models are type of classification or regression models where the predictors consist of physio-chemical properties or theoretical molecular and the response variable could be a biological activity of the chemicals, such as concentration of a substance required to give a certain biological response. The basic idea behind these methods is that activity of molecules is reflected in their structure and same molecules have the same activity. So if we learn the activity of a set of molecules structures ( or combinations) then we can predict the activity of similar molecules. QSAR methods are particularly computer intensive or require the adjustment of many sensitive parameters to achieve good prediction.In this sense the machine learning methods can be helpful and two of those methods: support vector machine (SVM) and random forest (RF) are commonly used. In this paper the authors would like to investigate the prediction performance of DNN as a QSAR method and compare it with RF performance that is somehow considered as a gold standard in this field. <br />
less attractive. <br />
<br />
== Motivation ==<br />
At the first of stage of drug discovery there are a huge number of candidate compounds that can be combined to produce a new drug. This process may involve a large number of compounds (>100 000) and a large number of descriptors (several thousands) that can have different biological activity. Predicting all biological activity of all compounds need a super huge number of experiments. The in silico discovery and using the optimization algorithms can substantially reduce the experiment work that need to be done. In this paper the performance of deep neural nets and random forest evaluated in predicting the biological activity of different descriptors when the methods are applied to 30 pharmaceutical data set. The performance of two approach are also compared using coefficient of determination.<br />
<br />
== Methods ==<br />
In order to compare the prediction performance of methods, DNN and RF fitted to 15 data sets from a pharmaceutical company, Merck. The smallest data set has 2092 molecules with 4596 unique AP, DP descriptors. Each molecule is represented by a list of features, i.e. “descriptors” in QSAR nomenclature. The descriptors are substructure descriptors (e.g., atom pairs (AP), MACCS keys, circular fingerprints, etc.) and donor-descriptors (DP). Both descriptors are of the following form:<br />
<br />
atom type i − (distance in bonds) − atom type j<br />
<br />
For AP, atom type includes the element, number of nonhydrogen neighbors, and number of pi electrons. For DP, atom type is one of seven (cation, anion, neutral donor, neutral acceptor, polar, hydrophobe, and other). A separate group of 15 different data sets labeled “Additional Data Sets” are used to validate the conclusions acquired from the Kaggle data sets. Each of these data sets was split into train and test set. The metric to evaluate prediction performance of methods is coefficient of determination (<math>R_2</math>. <br />
<br />
To run a RF, 100 trees were generated with m/3 descriptors used at each branch-point, where m was the number of unique descriptors in the training set. The tree nodes with 5 or fewer molecules were not split further. The trees parallelized to run one tree per processor on a cluster to run larger data sets in a reasonable time.<br />
<br />
The DNNs with input descriptors X of a molecule and output of the <math>O=f(\sum{i=1}{N}w_ix_i+b)</math> were fitted to data sets. Considering effect of many parameters influence the performance of a deep neural net, Ma and his colleagues did a sensitivity analysis and trained 71 DNNs with different parameters for each set of data. the parameters that they were considered were parameters related to data (Options for descriptor transformation: (1) no transformation, (2) logarithmic transformation, (3) binary transformation. Related to network architecture ( number of hidden layers, number of neurons in each hidden layer, activation functions: sigmoid or rectified linear unit ). parameters related to the DNN training strategy (single training set or joint from multiple sets,percentage of neurons to drop-out in each layer. they also considered the parameters Related to the mini-batched stochastic gradient descent procedure in the BP algorithm ( the minibatch<br />
size, number of epochs) and parameters to control the gradient descent optimization procedure (learning rate, momentum strength, and weight cost strength). In addition to the effect of these parameters on the DNN, The authors were interested in evaluating stability of results for a diverse set of QSAR tasks. Due to time-consuming process of evaluating the effect of the large number of adjustable parameters, a reasonable number of parameter settings were selected by adjusting the values of one or two parameters at a time, and then calculate the <math>R_2</math> DNNs trained with the selected parameter settings. These results allowed them to focus on a smaller number of parameters, and to finally generate a set of recommended values for all algorithmic parameters, which can lead to consistently good predictions. <br />
<br />
== Results ==<br />
For the first object of this paper that was comparing the performance of DNNs to Rf, over over 50 DNNs were trained using different parameter settings. These parameter settings were arbitrarily selected, but they attempted to cover a sufficient range of values of each adjustable parameter. Figure below shows the difference in <math>R_2</math> between DNNs and RF for each data set. Each column represents a QSAR data set, and each circle represents the improvement of a DNN over RF.<br />
<br />
<br />
<center><br />
[[File: fig1.PNG | frame | center |Overall DNN vs RF using arbitrarily selected parameter values. Each column represents a QSAR data set, and each circle represents the<br />
improvement, measured in <math>R^2</math>, of a DNN over RF ]]<br />
</center><br />
<br />
comparing the performance of different models shows that even when the worst DNN parameter setting was used for each QSAR task, the average R2 would be degraded only from 0.423 to 0.412, merely a 2.6% reduction. These results suggest that DNNs can generally outperform RF( table below).<br />
<br />
<br />
<center><br />
[[File: table1.PNG | frame | center |comparing test <math>R^2</math> of different models ]]<br />
</center><br />
<br />
The difference in <math>R^2</math> between DNN and RF by changing the the network architecture is shown in Figure 2. In order to limit the number of different parameter combinations they fixed the number of neurons in each hidden layer. Thirty two DNNs were trained for each data set by varying number of hidden layers and number of neurons in each layer while the other key adjustable parameters were kept unchanged. It is seen that when the number of hidden layer is two, having a small number of neurons in the layers degrades the predictive capability of DNNs. It can also be seen that, given any number of hidden layers, once the number of neurons per layer is sufficiently large, increasing the number of neurons further has only a marginal benefit. In Figure 2 we can see that the neural network achieved the same average predictive capability as RF when the network has only one hidden layer with 12 neurons. This size of neural network is indeed comparable with that of the classical neural network used in QSAR.<br />
<br />
<center><br />
[[File: fig2.PNG | frame | center |Figure 2. Impacts of Network Architecture. Each marker in the plot represents a choice of DNN network architecture. The markers share the same number of hidden layers are connected with a line. The measurement (i.e., y-axis) is the difference of the mean R2 between DNNs and RF. ]]<br />
</center><br />
<br />
To decide which activation function, Sigmoid or ReLU, at least 15 pairs of DNNs were trained For each data set. Each pair of DNNs shared the same adjustable parameter settings, except that one DNN used ReLU as the activation function, while the other used Sigmoid function. The data sets where ReLU is significantly 9 the difference was tested by one-sample Wilcoxon test) better than Sigmoid are colored in blue, and marked at the bottom with “+”s. In contrast, the data set where Sigmoid is significantly better than ReLU is colored in black, and marked at the bottom with “−”s( Figure 6). In 53.3% (out of 15) data sets, ReLU is statistically significantly better than Sigmoid. Overall ReLU improves the average <math>R^2</math> over Sigmoid by 0.016. <br />
<br />
<center><br />
[[File: fig3.PNG | frame | center |Figure 3. Choice of activation functions. Each column represents a QSAR data set, and each circle represents the difference, measured in <math>R^2</math>, of a pair of<br />
DNNs trained with ReLU and Sigmoid, respectively ]]<br />
</center><br />
<br />
Figure 4 presents the difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets. On average over all data sets, there seems to joint DNN has a better performance. However, the size of the training sets plays a critical role on whether a joint DNN is beneficial. For the two very largest data sets (i.e., 3A4 and LOGD), the individual DNNs seem better, indicating that joint DNNs are more proper for not much large data sets. <br />
<br />
<center><br />
[[File: fig4.PNG | frame | center |Figure 4. difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets ]]<br />
</center><br />
<br />
The authors refine their selection of the DNN adjustable parameters by studying the previous results. they used the logarithmic transformation, two hidden layers, at least 250 hidden layers an activation function of ReLU. The results are shown in Figure 5. Comparing these results with those in Figure 1 indicate that there are now 9 out of 15 data sets, whereDNNs outperforms RF even with the “worst” parameter setting, compared with 4 out of 15. The <math>R^2</math> averaged over all DNNs and all 15 data sets is 0.051 higher than that of RF.<br />
<br />
<center><br />
[[File: fig5.PNG | frame | center |Figure 5. DNN vs RF with refined parameter settings ]]<br />
</center><br />
<br />
as a conclusion for sensitivity analysis has been done in this work, the authors gave a recommendation on the adjustable parameters of DNNs as below:<br />
-logarithmic transformation <br />
-four hidden layers, with hidden layers to be 4000, 2000, 1000, and 1000, respectively<br />
-The dropout rates of 0 in the input layer, 25% in the first 3 hidden layer, and 10% in the last hidden layer<br />
-The activation function of ReLU<br />
-No unsupervised pretraining. The network parameters should be initialized as random values<br />
-Large number of epochs.<br />
-Learning rate of 0.05, momentum strength of 0.9, and weight cost strength of 0.0001.<br />
<br />
To check the consistency of DNNs predictions as was one of concerns of authors they compared the performance of RF with DNN on 15 additional QSAR data sets were arbitrarily selected from in-house data. Each additional data set was time-split into training and test sets in the same way as the Kaggle data sets. Individual DNNs were trained from the training set using the recommended parameters, and the test <math>R^2</math> of the DNN and RF were calculated from the test sets. Table below presents the results for the additional data sets. It is seen that the DNN with recommended parameters outperforms RF in 13<br />
out of the 15 additional data sets. The mean R2 of DNNs is 0.411, while that of RFs is 0.361, which is an improvement of 13.9%.<br />
<br />
<center><br />
[[File: table2.PNG | frame | center |Comparing RF with DNN trained using recommended parameter settings on 15 additional datasets]]<br />
</center><br />
<br />
== Discussion ==<br />
This paper demonstrate that DNN can be used as a practical QSAR method in place of RF which is now as a gold standard in the field of drug discovery in most cases. Although, the magnitude of the change in coefficient of determination relative to RF may is small in some datasets, on average its better than RF. The paper recommends a set of values for all DNN algorithmic parameters, which are appropriate for large QSAR data sets in an industrial drug discovery environment. The authors give some recommendation about how RF and DNN can be efficiently speeded up using highperformance computing technologies. RF can be<br />
accelerated using coarse parallelization on a cluster by giving one tree per node. In contrast, DNN can efficiently make use of the parallel computation capability of a modern GPU. <br />
<br />
In opposite of our expectation that unsupervised pretraining helps plays a critical role in the success of DNNs, have an inverse effect on the performance of QSAR tasks which need to be worked.<br />
Another future work is to develop an effective and efficient strategy for refining the adjustable parameters of DNNs for each particular QSAR task.<br />
This result of current paper suggested that cross-validation failed to be effective for fine-tuning the algorithmic parameters. Therefore, instead of using automatic methods for tuning DNN parameters, new approaches that can better indicate a DNN’s predictive capability in a time-split test set need<br />
to be developed before we can maximize the benefit of DNNs</div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Table2.PNG&diff=26723File:Table2.PNG2015-11-21T02:30:54Z<p>Mgohari2: </p>
<hr />
<div></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Fig5.PNG&diff=26722File:Fig5.PNG2015-11-21T02:04:19Z<p>Mgohari2: </p>
<hr />
<div></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Fig4.PNG&diff=26721File:Fig4.PNG2015-11-21T01:55:13Z<p>Mgohari2: </p>
<hr />
<div></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Fig3.PNG&diff=26720File:Fig3.PNG2015-11-21T01:46:42Z<p>Mgohari2: </p>
<hr />
<div></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Fig2.PNG&diff=26719File:Fig2.PNG2015-11-21T01:32:45Z<p>Mgohari2: </p>
<hr />
<div></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Table1.PNG&diff=26718File:Table1.PNG2015-11-21T01:17:07Z<p>Mgohari2: </p>
<hr />
<div></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Fig1.PNG&diff=26717File:Fig1.PNG2015-11-20T22:52:30Z<p>Mgohari2: </p>
<hr />
<div></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=26710on using very large target vocabulary for neural machine translation2015-11-20T15:33:16Z<p>Mgohari2: </p>
<hr />
<div>==Overview==<br />
<br />
This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation"<br />
<ref>S. Jean, K. Cho, R Memisevic, and Y. Bengio. [http://arxiv.org/pdf/1412.2007v2.pdf "On Using Very Large Target Vocabulary for Neural Machine Translation"], 2015.</ref><br />
The paper presents the application of importance sampling for neural machine translation with very large target vocabulary. Despite the advantages of neural networks in translation over the statistical machine translation systems, such as the phrase-based system, they suffer from some technical problems. Most importantly they are limited to work with a small vocabulary data set because of complexity and number of parameters have to be trained. The performance of the current neural nets are rapidly decrease if the number of unidentified words in this target vocabulary increase. In this paper Jean and his colleagues proposed a method of training based on the importance sampling which can uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrate better performance without losing the efficiency in time or speed.<br />
<br />
==Methods==<br />
<br />
Recall that the classic neural machine learning plays as encoder-decoder network. The encoder reads the source sentence x and encode it into a sequence of hidden states of h where <math>h_t=f(x_t,h_{t-1})</math>. In the decoder step, another neural network generates the translation vector of y based on the encoded sequence of hidden states h: <math>p(y_t\,|\,y_{<t},x)\propto exp\{q(y_{t-1}, z_t, c_t)\}</math> where <math>z_t=g(y_{t-1}, z_{t-1}, c_t)</math> and <math>c_t=r(z_{t-1}, h_1,..., H_T)</math><br />
<br />
The objective function which have to be maximized represented by <br />
<math>\theta=argmax\sum_{n=1}^{N}\sum_{t=1}^{T_n}logp(y_t^n\,|\,y_{<t}^n, x^n)</math><br />
<br />
where <math>(x^n, y^n)</math> is the n-th training pair of sentence, and <math>T_n</math> is the length of n-th target sentence <math>y^n</math>.<br />
The proposed model is based on specific implementation of neural machine translation that uses an attention mechanism, as recently proposed in <ref><br />
Bahdanau et al.,[http://arxiv.org/pdf/1409.0473v6.pdf NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE], 2014<br />
</ref>.<br />
In that the encoder is implemented by a bi-directional recurrent neural network,<math>h_t=[h_t^\leftarrow; h_t^\rightarrow]</math>. The decoder, at each time, computes the context<br />
vector <math>c_t</math> as a convex sum of the hidden states <math>(h_1,...,h_T)</math> with the coefficients <math>(\alpha_1,...,\alpha_T)</math> computed by<br />
<br />
<math>\alpha_t=\frac{exp\{a(h_t, z_t)\}}{\sum_{k}exp\{a(h_t, z_t)\}}</math><br />
where a is a feedforward neural network with a single hidden layer. <br />
Then the probability of the next target word is <br />
<br />
<math>p(y_t\ y_{<t}, x)=\frac{1}{Z} exp\{W_t^T\phi(y_{t-1}, z_t, c_t)+b_t\}</math>. In that <math>\phi</math> is an affine transformation followed by a nonlinear activation, <math>w_t</math> and <math>b_t</math> are the target word vector and the target word bias, respectively. Z is the normalization constant computed by<br />
<br />
<br />
<math> Z=\sum_{k:y_k\in V}exp\{W_t^T\phi(y_{t-1}, z_t, c_t)+b_t</math> where V is set of all the target words. <br />
<br />
<br />
The dot product between the feature <math>\phi(y_{t-1}, z_t, c_t)</math> and <math>w_t</math> is required to be done for all words in target vocabulary that is computationally complex and time consuming. <br />
The approach of this paper uses only a subset of sampled target words as a align vector to maximize Eq (6), instead of all the likely target words. The most naïve way to select a subset of target words is selection of K most frequent words. However, This skipping words from training processes is in contrast with using a large vocabulary, because practically we removed a bunch of words from target dictionary. Jean et al., proposed using an existing word alignment model to align the source and target words in the training corpus and build a dictionary. With the dictionary, for each source sentence, we construct a target word set consisting of the K-most frequent words (according to the estimated unigram probability) and, using the dictionary, at most <math>k\prime</math> likely target words for each source word. K and <math>k\prime</math> may be chosen either to meet the computational requirement or to maximize the translation performance on the development set. <br />
In order to avoid the growing complexity of computing the normalization constant, the authors proposed to use only a small subset <math>v\prime</math> of the target vocabulary at each update<ref><br />
Bengio and Sen´ et al, [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4443871.pdf Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model ],IEEEXplor, 2008<br />
</ref>. <br />
Let us consider the gradient of the log probability of the output in conditional probability of <math>y_t</math>. The gradient is composed of a positive and negative part:<br />
<br />
<br />
<math>\bigtriangledown=logp(y_t|Y_{<t}, x_t)=\bigtriangledown \mathbf\varepsilon(y_t)-\sum_{k:y_k\in V} p(y_k|y_{<t}, x) \bigtriangledown \mathbf\varepsilon(y_t) </math><br />
where the energy <math>\mathbf\varepsilon</math> is defined as <math>\mathbf\varepsilon(y_i)=W_j^T\phi(y_{j-1}, Z_j, C_j)+b_j</math>. The second term of gradiant is in essence the expected gradiant of the energy as <math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)</math>. <br />
The idea of the proposed approach is to approximate this expectation of the gradient by importance sampling with a small number of samples. Given a predefined proposal distribution Q and a set <math>v\prime</math> of samples from Q, we approximate the expectation with <br />
<br />
<math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)\approx \sum_{k:y_k\in V\prime} \frac{w_k}{\sum_{k\prime:y_k\prime\in V\prime}w_k\prime}\epsilon(y_k)</math> where <math>w_k=exp{\epsilon(y_k)-log Q(y_k)}</math><br />
<br />
In practice, the training corpus is partitioned and a subset <math>v\prime</math> of the target vocabulary is defined for each partition prior to training. Before training begins, each target sentence in the training corpus is sequentially examined and accumulate unique target words until the number of unique target<br />
words reaches the predefined threshold τ . The accumulated vocabulary will be used for this partition of the corpus during training. This processes is repeated until the end of the training set is reached. <br />
0<br />
In this approach the alignments between the target words and source locations via the alignment model is obtained. This is useful when the model generated an Un token. Once a translation is generated given a source sentence, each Un may be replaced using a translation-specific technique based on the aligned source word. The authors in the experiment, replaced each ''Un'' token with the aligned source word or its most likely translation determined by another word alignment model.<br />
The proposed approach was evaluated in English->French and English-German translation. The neural machine translation model was trained by the bilingual, parallel corpora made available as part of WMT’14. The data sets were used for English to French were European v7, Common Crawl, UN, News Commentary, Gigaword. The data sets for English-German were Europarl v7, Common Crawl, News Commentary. <br />
The models were evaluated on the WMT’14 test set (news-test 2014)3 , while the concatenation of news-test-2012 and news-test-2013 is used for model selection (development set). Table 1 presents data coverage w.r.t. the vocabulary size, on the target side.<br />
<br />
==Setting==<br />
<br />
As a baseline for English→French translation, the authors used the RNNsearch model proposed by (Bahdanau et al., 2014), with 30,000 source and target words and also another RNNsearch was trained for English→German translation with 50,000 source and target words. Using the proposed approach another set of RNNsearch models with much larger vocabularies of 500,000 source and target words was trained for each language pair. Different shortlist sizes used during training: 15,000 and 30,000 for English→French, and 15,000 and 50,000 for English→German. The best performance on the development set were evaluated and reported every twelve hours. For both language pairs, new models were trained with shortlist size of 15, 000 and 50, 000 by reshuffling the dataset at the beginning of each epoch. While this causes a non-negligible amount of overhead, such a change allows words to be contrasted with different sets of other words each epoch. The beam search was used to generate a translation given a source. The authors keep a set of 12 hypotheses and normalize probabilities by the length of the candidate sentences which was chosen to maximize the performance on the development set, for K ∈ {15k, 30k, 50k} and K0 ∈ {10, 20}. They test using a bilingual dictionary to accelerate decoding and to replace unknown words in translations.<br />
<br />
==Results==<br />
<br />
The results for English-> French translation obtained by the trained models with very large target vocabularies compared with results of previous models reported in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Google<br />
! Phrase-based SMT (cHO et al)<br />
! Phrase-based SMT (Durrani et al)<br />
|-<br />
| BASIC NMT<br />
| 29.97 (26.58)<br />
| 32.68 (28.76)<br />
| 30.6<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 33.08 (29.08)<br />
| 33.36 (29.32)<br />
34.11 (29.98)<br />
| -<br />
33.1<br />
| 33.3<br />
| 37.03<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 34.6 (30.53)<br />
| -<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Ensemble<br />
| -<br />
| 37.19 (31.98)<br />
| 37.5 <br />
| 33.3<br />
| 3703<br />
|-<br />
|}<br />
<br />
<br />
And the results for English->German translation in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Phrase-based SMT <br />
|-<br />
| BASIC NMT<br />
| 16.46 (17.13)<br />
| 16.95 (17.85)<br />
| 20.67<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 18.97 (19.16)<br />
| 17.46 (18.00)<br />
18.89 (19.03)<br />
| 20.67<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 19.4<br />
| 20.67<br />
|-<br />
| + Ensemble<br />
| -<br />
| 21.59<br />
| 20.67 <br />
|-<br />
|}<br />
<br />
It is clear that the RNNsearch-LV outperforms the baseline RNNsearch. In the case of the English→French task, RNNsearch-LV approached the performance level of the previous best single neural machine translation (NMT) model, even without any translationspecific techniques. With these, however, the RNNsearch-LV outperformed it. The performance of the RNNsearch-LV is also better than that of a standard phrase-based translation system. <br />
For English→German, the RNNsearch-LV outperformed the baseline before unknown word replacement, but after doing so, the two systems performed similarly. A higher large vocabulary single-model performance is achieved by reshuffling the dataset. In this case, we were able to surpass the previously reported best translation result on this task by building an ensemble of 8 models. With τ = 15, 000, the RNNsearch-LV performance worsened a little, with best BLEU scores, without reshuffling, of 33.76 and 18.59 respectively for English→French and English→German.<br />
<br />
The timing information of decoding for different models were presented in Table below. While decoding from RNNsearch-LV with the full target vocabulary is slowest, the speed substantially improves if a candidate list for decoding each translation is used. <br />
{| class="wikitable"<br />
|-<br />
! Method <br />
! CPU i7-4820k<br />
! GPU GTX TITAN black<br />
|-<br />
| RNNsearch<br />
| 0.09 s<br />
| 0.02 s<br />
|-<br />
| RNNsearch-LV <br />
| 0.80 s<br />
| 0.25 s<br />
|-<br />
| RNNsearch-LV<br />
+Candidate list<br />
| 0.12 s<br />
| 0.0.05 s<br />
|}<br />
<br />
The influence of the target vocabulary when translating the test sentences by using the union of a fixed set of 30, 000 common words and (at most) K0 likely candidates for each source word was evaluated for English→French with size of 30, 000. The performance of the system is comparable to the baseline when Uns not replaced, but there is not as much improvement when doing so.<br />
The authors found that K is inversely correlated with t. <br />
<br />
<br />
==Conclusion==<br />
<br />
Using the importance sampling an approach was proposed to be used in machine translation with a large target vocabulary without any substantial increase in computational complexity. The BLUE values for the proposed model showed translation performance comparable to the state-of-the-art translation systems on both the English→French task and English→German task.<br />
On English→French and English→German translation tasks, the neural machine translation models trained using the proposed method performed as well as, or better than, those using only limited sets of target words, even when replacing unknown words.<br />
<br />
<br />
== Bibliography ==<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=26709on using very large target vocabulary for neural machine translation2015-11-20T15:23:55Z<p>Mgohari2: </p>
<hr />
<div>==Overview==<br />
<br />
This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation"<br />
<ref>S. Jean, K. Cho, R Memisevic, and Y. Bengio. [http://arxiv.org/pdf/1412.2007v2.pdf "On Using Very Large Target Vocabulary for Neural Machine Translation"], 2015.</ref><br />
The paper presents the application of importance sampling for neural machine translation with very large target vocabulary. Despite the advantages of neural networks in translation over the statistical machine translation systems, such as the phrase-based system, they suffer from some technical problems. Most importantly they are limited to work with a small vocabulary data set because of complexity and number of parameters have to be trained. The performance of the current neural nets are rapidly decrease if the number of unidentified words in this target vocabulary increase. In this paper Jean and his colleagues proposed a method of training based on the importance sampling which can uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrate better performance without losing the efficiency in time or speed.<br />
<br />
==Methods==<br />
<br />
Recall that the classic neural machine learning plays as encoder-decoder network. The encoder reads the source sentence x and encode it into a sequence of hidden states of h where <math>h_t=f(x_t,h_{t-1})</math>. In the decoder step, another neural network generates the translation vector of y based on the encoded sequence of hidden states h: <math>p(y_t\,|\,y_{<t},x)\propto exp\{q(y_{t-1}, z_t, c_t)\}</math> where <math>z_t=g(y_{t-1}, z_{t-1}, c_t)</math> and <math>c_t=r(z_{t-1}, h_1,..., H_T)</math><br />
<br />
The objective function which have to be maximized represented by <br />
<math>\theta=argmax\sum_{n=1}^{N}\sum_{t=1}^{T_n}logp(y_t^n\,|\,y_{<t}^n, x^n)</math><br />
<br />
where <math>(x^n, y^n)</math> is the n-th training pair of sentence, and <math>T_n</math> is the length of n-th target sentence <math>y^n</math>.<br />
The proposed model is based on specific implementation of neural machine translation that uses an attention mechanism, as recently proposed in <ref><br />
Bahdanau et al.,[http://arxiv.org/pdf/1409.0473v6.pdf NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE], 2014<br />
</ref>.<br />
In that the encoder is implemented by a bi-directional recurrent neural network,<math>h_t=[h_i^\leftarrow; h_t^\rightarrow]</math>. The decoder, at each time, computes the context<br />
vector ct as a convex sum of the hidden states (h1, . . . , hT ) with the coefficients (α1, . . . , αT) computed by<br />
<br />
<math>\alpha_t=\frac{exp\{a(h_t, z_t)\}}{\sum_{k}exp\{a(h_t, z_t)\}}</math><br />
where a is a feedforward neural network with a single hidden layer. <br />
Then the probability of the next target word is <br />
<br />
<math>p(y_t\ y_{<t}, x)=\frac{1}{Z} exp\{W_t^T\phi(y_{t-1}, z_t, c_t)+b_t\}</math>. In that <math>\phi</math> is an affine transformation followed by a nonlinear activation, <math>w_t</math> and <math>b_t</math> are the target word vector and the target word bias, respectively. Z is the normalization constant computed by<br />
<br />
<br />
<math> Z=\sum_{k:y_k\in V}exp\{W_t^T\phi(y_{t-1}, z_t, c_t)+b_t</math> where V is set of all the target words. <br />
<br />
<br />
The dot product between the feature <math>\phi(y_{t-1}, z_t, c_t)</math> and <math>w_t</math> is required to be done for all words in target vocabulary that is computationally complex and time consuming. <br />
The approach of this paper uses only a subset of sampled target words as a align vector to maximize Eq (6), instead of all the likely target words. The most naïve way to select a subset of target words is selection of K most frequent words. However, This skipping words from training processes is in contrast with using a large vocabulary, because practically we removed a bunch of words from target dictionary. Jean et al., proposed using an existing word alignment model to align the source and target words in the training corpus and build a dictionary. With the dictionary, for each source sentence, we construct a target word set consisting of the K-most frequent words (according to the estimated unigram probability) and, using the dictionary, at most <math>k\prime</math> likely target words for each source word. K and <math>k\prime</math> may be chosen either to meet the computational requirement or to maximize the translation performance on the development set. <br />
In order to avoid the growing complexity of computing the normalization constant, the authors proposed to use only a small subset <math>v\prime</math> of the target vocabulary at each update<ref><br />
Bengio and Sen´ et al, [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4443871.pdf Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model ],IEEEXplor, 2008<br />
</ref>. <br />
Let us consider the gradient of the log probability of the output in conditional probability of <math>y_t</math>. The gradient is composed of a positive and negative part:<br />
<br />
<br />
<math>\bigtriangledown=logp(y_t|Y_{<t}, x_t)=\bigtriangledown \mathbf\varepsilon(y_t)-\sum_{k:y_k\in V} p(y_k|y_{<t}, x) \bigtriangledown \mathbf\varepsilon(y_t) </math><br />
where the energy <math>\mathbf\varepsilon</math> is defined as <math>\mathbf\varepsilon(y_i)=W_j^T\phi(y_{j-1}, Z_j, C_j)+b_j</math>. The second term of gradiant is in essence the expected gradiant of the energy as <math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)</math>. <br />
The idea of the proposed approach is to approximate this expectation of the gradient by importance sampling with a small number of samples. Given a predefined proposal distribution Q and a set <math>v\prime</math> of samples from Q, we approximate the expectation with <br />
<br />
<math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)\approx \sum_{k:y_k\in V\prime} \frac{w_k}{\sum_{k\prime:y_k\prime\in V\prime}w_k\prime}\epsilon(y_k)</math> where <math>w_k=exp{\epsilon(y_k)-log Q(y_k)}</math><br />
<br />
In practice, the training corpus is partitioned and a subset <math>v\prime</math> of the target vocabulary is defined for each partition prior to training. Before training begins, each target sentence in the training corpus is sequentially examined and accumulate unique target words until the number of unique target<br />
words reaches the predefined threshold τ . The accumulated vocabulary will be used for this partition of the corpus during training. This processes is repeated until the end of the training set is reached. <br />
0<br />
In this approach the alignments between the target words and source locations via the alignment model is obtained. This is useful when the model generated an Un token. Once a translation is generated given a source sentence, each Un may be replaced using a translation-specific technique based on the aligned source word. The authors in the experiment, replaced each ''Un'' token with the aligned source word or its most likely translation determined by another word alignment model.<br />
The proposed approach was evaluated in English->French and English-German translation. The neural machine translation model was trained by the bilingual, parallel corpora made available as part of WMT’14. The data sets were used for English to French were European v7, Common Crawl, UN, News Commentary, Gigaword. The data sets for English-German were Europarl v7, Common Crawl, News Commentary. <br />
The models were evaluated on the WMT’14 test set (news-test 2014)3 , while the concatenation of news-test-2012 and news-test-2013 is used for model selection (development set). Table 1 presents data coverage w.r.t. the vocabulary size, on the target side.<br />
<br />
==Setting==<br />
<br />
As a baseline for English→French translation, the authors used the RNNsearch model proposed by (Bahdanau et al., 2014), with 30,000 source and target words and also another RNNsearch was trained for English→German translation with 50,000 source and target words. Using the proposed approach another set of RNNsearch models with much larger vocabularies of 500,000 source and target words was trained for each language pair. Different shortlist sizes used during training: 15,000 and 30,000 for English→French, and 15,000 and 50,000 for English→German. The best performance on the development set were evaluated and reported every twelve hours. For both language pairs, new models were trained with shortlist size of 15, 000 and 50, 000 by reshuffling the dataset at the beginning of each epoch. While this causes a non-negligible amount of overhead, such a change allows words to be contrasted with different sets of other words each epoch. The beam search was used to generate a translation given a source. The authors keep a set of 12 hypotheses and normalize probabilities by the length of the candidate sentences which was chosen to maximize the performance on the development set, for K ∈ {15k, 30k, 50k} and K0 ∈ {10, 20}. They test using a bilingual dictionary to accelerate decoding and to replace unknown words in translations.<br />
<br />
==Results==<br />
<br />
The results for English-> French translation obtained by the trained models with very large target vocabularies compared with results of previous models reported in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Google<br />
! Phrase-based SMT (cHO et al)<br />
! Phrase-based SMT (Durrani et al)<br />
|-<br />
| BASIC NMT<br />
| 29.97 (26.58)<br />
| 32.68 (28.76)<br />
| 30.6<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 33.08 (29.08)<br />
| 33.36 (29.32)<br />
34.11 (29.98)<br />
| -<br />
33.1<br />
| 33.3<br />
| 37.03<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 34.6 (30.53)<br />
| -<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Ensemble<br />
| -<br />
| 37.19 (31.98)<br />
| 37.5 <br />
| 33.3<br />
| 3703<br />
|-<br />
|}<br />
<br />
<br />
And the results for English->German translation in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Phrase-based SMT <br />
|-<br />
| BASIC NMT<br />
| 16.46 (17.13)<br />
| 16.95 (17.85)<br />
| 20.67<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 18.97 (19.16)<br />
| 17.46 (18.00)<br />
18.89 (19.03)<br />
| 20.67<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 19.4<br />
| 20.67<br />
|-<br />
| + Ensemble<br />
| -<br />
| 21.59<br />
| 20.67 <br />
|-<br />
|}<br />
<br />
It is clear that the RNNsearch-LV outperforms the baseline RNNsearch. In the case of the English→French task, RNNsearch-LV approached the performance level of the previous best single neural machine translation (NMT) model, even without any translationspecific techniques. With these, however, the RNNsearch-LV outperformed it. The performance of the RNNsearch-LV is also better than that of a standard phrase-based translation system. <br />
For English→German, the RNNsearch-LV outperformed the baseline before unknown word replacement, but after doing so, the two systems performed similarly. A higher large vocabulary single-model performance is achieved by reshuffling the dataset. In this case, we were able to surpass the previously reported best translation result on this task by building an ensemble of 8 models. With τ = 15, 000, the RNNsearch-LV performance worsened a little, with best BLEU scores, without reshuffling, of 33.76 and 18.59 respectively for English→French and English→German.<br />
<br />
The timing information of decoding for different models were presented in Table below. While decoding from RNNsearch-LV with the full target vocabulary is slowest, the speed substantially improves if a candidate list for decoding each translation is used. <br />
{| class="wikitable"<br />
|-<br />
! Method <br />
! CPU i7-4820k<br />
! GPU GTX TITAN black<br />
|-<br />
| RNNsearch<br />
| 0.09 s<br />
| 0.02 s<br />
|-<br />
| RNNsearch-LV <br />
| 0.80 s<br />
| 0.25 s<br />
|-<br />
| RNNsearch-LV<br />
+Candidate list<br />
| 0.12 s<br />
| 0.0.05 s<br />
|}<br />
<br />
The influence of the target vocabulary when translating the test sentences by using the union of a fixed set of 30, 000 common words and (at most) K0 likely candidates for each source word was evaluated for English→French with size of 30, 000. The performance of the system is comparable to the baseline when Uns not replaced, but there is not as much improvement when doing so.<br />
The authors found that K is inversely correlated with t. The single-model test BLEU scores for English→French with respect to the number of dictionary entries <math>k\prime</math> allowed for each source word is presented in figure below<br />
<br />
<br />
==Conclusion==<br />
<br />
Using the importance sampling an approach was proposed to be used in machine translation with a large target vocabulary without any substantial increase in computational complexity. The BLUE values for the proposed model showed translation performance comparable to the state-of-the-art translation systems on both the English→French task and English→German task.<br />
On English→French and English→German translation tasks, the neural machine translation models trained using the proposed method performed as well as, or better than, those using only limited sets of target words, even when replacing unknown words.<br />
<br />
<br />
== Bibliography ==<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=26679on using very large target vocabulary for neural machine translation2015-11-20T01:07:36Z<p>Mgohari2: </p>
<hr />
<div>'''Overview'''<br />
<br />
This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation"<br />
<ref>S. Jean, K. Cho, R Memisevic, and Y. Bengio. [http://arxiv.org/pdf/1412.2007v2.pdf "On Using Very Large Target Vocabulary for Neural Machine Translation"], 2015.</ref><br />
The paper presents the application of importance sampling for neural machine translation with very large target vocabulary. Despite the advantages of neural networks in translation over the statistical machine translation systems, such as the phrase-based system, they suffer from some technical problems. Most importantly they are limited to work with a small vocabulary data set because of complexity and number of parameters have to be trained. The performance of the current neural nets are rapidly decrease if the number of unidentified words in this target vocabulary increase. In this paper Jean and his colleagues proposed a method of training based on the importance sampling which can uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrate better performance without losing the efficiency in time or speed.<br />
<br />
'''Methods'''<br />
<br />
Recall that the classic neural machine learning plays as encoder-decoder network. The encoder reads the source sentence x and encode it into a sequence of hidden states of h where <math>h_t=f(x_t,h_{t-1})</math>. In the decoder step, another neural network generates the translation vector of y based on the encoded sequence of hidden states h: <math>p(y_t\,|\,y_{<t},x)\propto exp\{q(y_{t-1}, z_t, c_t)\}</math> where <math>z_t=g(y_{t-1}, z_{t-1}, c_t)</math> and <math>c_t=r(z_{t-1}, h_1,..., H_T)</math><br />
<br />
The objective function which have to be maximized represented by <br />
<math>\theta=argmax\sum_{n=1}^{N}\sum_{t=1}^{T_n}logp(y_t^n\,|\,y_{<t}^n, x^n)</math><br />
<br />
where <math>(x^n, y^n)</math> is the n-th training pair of sentence, and <math>T_n</math> is the length of n-th target sentence <math>y^n</math>.<br />
The proposed model is based on specific implementation of neural machine translation that uses an attention mechanism, as recently proposed in <ref><br />
Bahdanau et al.,[http://arxiv.org/pdf/1409.0473v6.pdf NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE], 2014<br />
</ref>.<br />
In that the encoder is implemented by a bi-directional recurrent neural network,<math>h_t=[h_i^\leftarrow; h_t^\rightarrow]</math>. The decoder, at each time, computes the context<br />
vector ct as a convex sum of the hidden states (h1, . . . , hT ) with the coefficients (α1, . . . , αT) computed by<br />
<br />
<math>\alpha_t=\frac{exp\{a(h_t, z_t)\}}{\sum_{k}exp\{a(h_t, z_t)\}}</math><br />
where a is a feedforward neural network with a single hidden layer. <br />
Then the probability of the next target word is <br />
<br />
<math>p(y_t\ y_{<t}, x)=\frac{1}{Z} exp\{W_t^T\phi(y_{t-1}, z_t, c_t)+b_t\}</math>. In that <math>\phi</math> is an affine transformation followed by a nonlinear activation, <math>w_t</math> and <math>b_t</math> are the target word vector and the target word bias, respectively. Z is the normalization constant computed by<br />
<br />
<br />
<math> Z=\sum_{k:y_k\in V}exp\{W_t^T\phi(y_{t-1}, z_t, c_t)+b_t</math> where V is set of all the target words. <br />
<br />
<br />
The dot product between the feature <math>\phi(y_{t-1}, z_t, c_t)</math> and <math>w_t</math> is required to be done for all words in target vocabulary that is computationally complex and time consuming. <br />
The approach of this paper uses only a subset of sampled target words as a align vector to maximize Eq (6), instead of all the likely target words. The most naïve way to select a subset of target words is selection of K most frequent words. However, This skipping words from training processes is in contrast with using a large vocabulary, because practically we removed a bunch of words from target dictionary. Jean et al., proposed using an existing word alignment model to align the source and target words in the training corpus and build a dictionary. With the dictionary, for each source sentence, we construct a target word set consisting of the K-most frequent words (according to the estimated unigram probability) and, using the dictionary, at most <math>k\prime</math> likely target words for each source word. K and <math>k\prime</math> may be chosen either to meet the computational requirement or to maximize the translation performance on the development set. <br />
In order to avoid the growing complexity of computing the normalization constant, the authors proposed to use only a small subset <math>v\prime</math> of the target vocabulary at each update<ref><br />
Bengio and Sen´ et al, [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4443871.pdf Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model ],IEEEXplor, 2008<br />
</ref>. <br />
Let us consider the gradient of the log probability of the output in conditional probability of <math>y_t</math>. The gradient is composed of a positive and negative part:<br />
<br />
<br />
<math>\bigtriangledown=logp(y_t|Y_{<t}, x_t)=\bigtriangledown \mathbf\varepsilon(y_t)-\sum_{k:y_k\in V} p(y_k|y_{<t}, x) \bigtriangledown \mathbf\varepsilon(y_t) </math><br />
where the energy <math>\mathbf\varepsilon</math> is defined as <math>\mathbf\varepsilon(y_i)=W_j^T\phi(y_{j-1}, Z_j, C_j)+b_j</math>. The second term of gradiant is in essence the expected gradiant of the energy as <math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)</math>. <br />
The idea of the proposed approach is to approximate this expectation of the gradient by importance sampling with a small number of samples. Given a predefined proposal distribution Q and a set <math>v\prime</math> of samples from Q, we approximate the expectation with <br />
<br />
<math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)\approx \sum_{k:y_k\in V\prime} \frac{w_k}{\sum_{k\prime:y_k\prime\in V\prime}w_k\prime}\epsilon(y_k)</math> where <math>w_k=exp{\epsilon(y_k)-log Q(y_k)}</math><br />
<br />
In practice, the training corpus is partitioned and a subset <math>v\prime</math> of the target vocabulary is defined for each partition prior to training. Before training begins, each target sentence in the training corpus is sequentially examined and accumulate unique target words until the number of unique target<br />
words reaches the predefined threshold τ . The accumulated vocabulary will be used for this partition of the corpus during training. This processes is repeated until the end of the training set is reached. <br />
0<br />
In this approach the alignments between the target words and source locations via the alignment model is obtained. This is useful when the model generated an Un token. Once a translation is generated given a source sentence, each Un may be replaced using a translation-specific technique based on the aligned source word. The authors in the experiment, replaced each ''Un'' token with the aligned source word or its most likely translation determined by another word alignment model.<br />
The proposed approach was evaluated in English->French and English-German translation. The neural machine translation model was trained by the bilingual, parallel corpora made available as part of WMT’14. The data sets were used for English to French were European v7, Common Crawl, UN, News Commentary, Gigaword. The data sets for English-German were Europarl v7, Common Crawl, News Commentary. <br />
The models were evaluated on the WMT’14 test set (news-test 2014)3 , while the concatenation of news-test-2012 and news-test-2013 is used for model selection (development set). Table 1 presents data coverage w.r.t. the vocabulary size, on the target side.<br />
<br />
'''Setting'''<br />
<br />
As a baseline for English→French translation, the authors used the RNNsearch model proposed by (Bahdanau et al., 2014), with 30,000 source and target words and also another RNNsearch was trained for English→German translation with 50,000 source and target words. Using the proposed approach another set of RNNsearch models with much larger vocabularies of 500,000 source and target words was trained for each language pair. Different shortlist sizes used during training: 15,000 and 30,000 for English→French, and 15,000 and 50,000 for English→German. The best performance on the development set were evaluated and reported every twelve hours. For both language pairs, new models were trained with shortlist size of 15, 000 and 50, 000 by reshuffling the dataset at the beginning of each epoch. While this causes a non-negligible amount of overhead, such a change allows words to be contrasted with different sets of other words each epoch. The beam search was used to generate a translation given a source. The authors keep a set of 12 hypotheses and normalize probabilities by the length of the candidate sentences which was chosen to maximize the performance on the development set, for K ∈ {15k, 30k, 50k} and K0 ∈ {10, 20}. They test using a bilingual dictionary to accelerate decoding and to replace unknown words in translations.<br />
<br />
'''Results'''<br />
<br />
The results for English-> French translation obtained by the trained models with very large target vocabularies compared with results of previous models reported in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Google<br />
! Phrase-based SMT (cHO et al)<br />
! Phrase-based SMT (Durrani et al)<br />
|-<br />
| BASIC NMT<br />
| 29.97 (26.58)<br />
| 32.68 (28.76)<br />
| 30.6<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 33.08 (29.08)<br />
| 33.36 (29.32)<br />
34.11 (29.98)<br />
| -<br />
33.1<br />
| 33.3<br />
| 37.03<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 34.6 (30.53)<br />
| -<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Ensemble<br />
| -<br />
| 37.19 (31.98)<br />
| 37.5 <br />
| 33.3<br />
| 3703<br />
|-<br />
|}<br />
<br />
<br />
And the results for English->German translation in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Phrase-based SMT <br />
|-<br />
| BASIC NMT<br />
| 16.46 (17.13)<br />
| 16.95 (17.85)<br />
| 20.67<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 18.97 (19.16)<br />
| 17.46 (18.00)<br />
18.89 (19.03)<br />
| 20.67<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 19.4<br />
| 20.67<br />
|-<br />
| + Ensemble<br />
| -<br />
| 21.59<br />
| 20.67 <br />
|-<br />
|}<br />
<br />
It is clear that the RNNsearch-LV outperforms the baseline RNNsearch. In the case of the English→French task, RNNsearch-LV approached the performance level of the previous best single neural machine translation (NMT) model, even without any translationspecific techniques. With these, however, the RNNsearch-LV outperformed it. The performance of the RNNsearch-LV is also better than that of a standard phrase-based translation system. <br />
For English→German, the RNNsearch-LV outperformed the baseline before unknown word replacement, but after doing so, the two systems performed similarly. A higher large vocabulary single-model performance is achieved by reshuffling the dataset. In this case, we were able to surpass the previously reported best translation result on this task by building an ensemble of 8 models. With τ = 15, 000, the RNNsearch-LV performance worsened a little, with best BLEU scores, without reshuffling, of 33.76 and 18.59 respectively for English→French and English→German.<br />
<br />
The timing information of decoding for different models were presented in Table 3. While decoding from RNNsearch-LV with the full target vocabulary is slowest, the speed substantially improves if a candidate list for decoding each translation is used. <br />
The influence of the target vocabulary when translating the test sentences by using the union of a fixed set of 30, 000 common words and (at most) K0 likely candidates for each source word was evaluated for English→French with size of 30, 000. The performance of the system is comparable to the baseline when Uns not replaced, but there is not as much improvement when doing so.<br />
The authors found that K is inversely correlated with t. The single-model test BLEU scores for English→French with respect to the number of dictionary entries <math>k\prime</math> allowed for each source word is presented in figure below<br />
<br />
<br />
'''Conclusion'''<br />
<br />
Using the importance sampling an approach was proposed to be used in machine translation with a large target vocabulary without any substantial increase in computational complexity. The BLUE values for the proposed model showed translation performance comparable to the state-of-the-art translation systems on both the English→French task and English→German task.<br />
On English→French and English→German translation tasks, the neural machine translation models trained using the proposed method performed as well as, or better than, those using only limited sets of target words, even when replacing unknown words.<br />
<br />
<br />
== Bibliography ==<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=26678on using very large target vocabulary for neural machine translation2015-11-20T00:41:33Z<p>Mgohari2: </p>
<hr />
<div>'''Overview'''<br />
<br />
This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation"<ref><br />
S. Jean, K. Cho, R Memisevic, and Y. Bengio, "On Using Very Large Target Vocabulary for Neural Machine Translation", <br />
</ref>. The paper presents the application of importance sampling for neural machine translation with very large target vocabulary. Despite the advantages of neural networks in translation over the statistical machine translation systems, such as the phrase-based system, they suffer from some technical problems. Most importantly they are limited to work with a small vocabulary data set because of complexity and number of parameters have to be trained. The performance of the current neural nets are rapidly decrease if the number of unidentified words in this target vocabulary increase. In this paper Jean and his colleagues proposed a method of training based on the importance sampling which can uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrate better performance without losing the efficiency in time or speed.<br />
<br />
'''Methods'''<br />
<br />
Recall that the classic neural machine learning plays as encoder-decoder network. The encoder reads the source sentence x and encode it into a sequence of hidden states of h where <math>h_t=f(x_t,h_{t-1})</math>. In the decoder step, another neural network generates the translation vector of y based on the encoded sequence of hidden states h: <math>p(y_t\,|\,y_{<t},x)\propto exp\{q(y_{t-1}, z_t, c_t)\}</math> where <math>z_t=g(y_{t-1}, z_{t-1}, c_t)</math> and <math>c_t=r(z_{t-1}, h_1,..., H_T)</math><br />
<br />
The objective function which have to be maximized represented by <br />
<math>\theta=argmax\sum_{n=1}^{N}\sum_{t=1}^{T_n}logp(y_t^n\,|\,y_{<t}^n, x^n)</math><br />
<br />
where <math>(x^n, y^n)</math> is the n-th training pair of sentence, and <math>T_n</math> is the length of n-th target sentence <math>y^n</math>.<br />
The proposed model is based on specific implementation of neural machine translation that uses an attention mechanism, as recently proposed in <ref><br />
Bahdanau et<br />
al., 2014<br />
</ref>.<br />
In that the encoder is implemented by a bi-directional recurrent neural network,<math>h_t=[h_i^\leftarrow; h_t^\rightarrow]</math>. The decoder, at each time, computes the context<br />
vector ct as a convex sum of the hidden states (h1, . . . , hT ) with the coefficients (α1, . . . , αT) computed by<br />
<br />
<math>\alpha_t=\frac{exp\{a(h_t, z_t)\}}{\sum_{k}exp\{a(h_t, z_t)\}}</math><br />
where a is a feedforward neural network with a single hidden layer. <br />
Then the probability of the next target word is <br />
<br />
<math>p(y_t\ y_{<t}, x)=\frac{1}{Z} exp\{W_t^T\phi(y_{t-1}, z_t, c_t)+b_t\}</math>. In that <math>\phi</math> is an affine transformation followed by a nonlinear activation, <math>w_t</math> and <math>b_t</math> are the target word vector and the target word bias, respectively. Z is the normalization constant computed by<br />
<br />
<br />
<math> Z=\sum_{k:y_k\in V}exp\{W_t^T\phi(y_{t-1}, z_t, c_t)+b_t</math> where V is set of all the target words. <br />
<br />
<br />
The dot product between the feature <math>\phi(y_{t-1}, z_t, c_t)</math> and <math>w_t</math> is required to be done for all words in target vocabulary that is computationally complex and time consuming. <br />
The approach of this paper uses only a subset of sampled target words as a align vector to maximize Eq (6), instead of all the likely target words. The most naïve way to select a subset of target words is selection of K most frequent words. However, This skipping words from training processes is in contrast with using a large vocabulary, because practically we removed a bunch of words from target dictionary. Jean et al., proposed using an existing word alignment model to align the source and target words in the training corpus and build a dictionary. With the dictionary, for each source sentence, we construct a target word set consisting of the K-most frequent words (according to the estimated unigram probability) and, using the dictionary, at most <math>k\prime</math> likely target words for each source word. K and <math>k\prime</math> may be chosen either to meet the computational requirement or to maximize the translation performance on the development set. <br />
In order to avoid the growing complexity of computing the normalization constant, the authors proposed to use only a small subset <math>v\prime</math> of the target vocabulary at each update<ref><br />
Bengio and Sen´ ecal, 2008<br />
</ref>. <br />
Let us consider the gradient of the log probability of the output in conditional probability of <math>y_t</math>. The gradient is composed of a positive and negative part:<br />
<br />
<br />
<math>\bigtriangledown=logp(y_t|Y_{<t}, x_t)=\bigtriangledown \mathbf\varepsilon(y_t)-\sum_{k:y_k\in V} p(y_k|y_{<t}, x) \bigtriangledown \mathbf\varepsilon(y_t) </math><br />
where the energy <math>\mathbf\varepsilon</math> is defined as <math>\mathbf\varepsilon(y_i)=W_j^T\phi(y_{j-1}, Z_j, C_j)+b_j</math>. The second term of gradiant is in essence the expected gradiant of the energy as <math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)</math>. <br />
The idea of the proposed approach is to approximate this expectation of the gradient by importance sampling with a small number of samples. Given a predefined proposal distribution Q and a set <math>v\prime</math> of samples from Q, we approximate the expectation with <br />
<br />
<math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)\approx \sum_{k:y_k\in V\prime} \frac{w_k}{\sum_{k\prime:y_k\prime\in V\prime}w_k\prime}\epsilon(y_k)</math> where <math>w_k=exp{\epsilon(y_k)-log Q(y_k)}</math><br />
<br />
In practice, the training corpus is partitioned and a subset <math>v\prime</math> of the target vocabulary is defined for each partition prior to training. Before training begins, each target sentence in the training corpus is sequentially examined and accumulate unique target words until the number of unique target<br />
words reaches the predefined threshold τ . The accumulated vocabulary will be used for this partition of the corpus during training. This processes is repeated until the end of the training set is reached. <br />
0<br />
In this approach the alignments between the target words and source locations via the alignment model is obtained. This is useful when the model generated an Un token. Once a translation is generated given a source sentence, each Un may be replaced using a translation-specific technique based on the aligned source word. The authors in the experiment, replaced each ''Un'' token with the aligned source word or its most likely translation determined by another word alignment model.<br />
The proposed approach was evaluated in English->French and English-German translation. The neural machine translation model was trained by the bilingual, parallel corpora made available as part of WMT’14. The data sets were used for English to French were European v7, Common Crawl, UN, News Commentary, Gigaword. The data sets for English-German were Europarl v7, Common Crawl, News Commentary. <br />
The models were evaluated on the WMT’14 test set (news-test 2014)3 , while the concatenation of news-test-2012 and news-test-2013 is used for model selection (development set). Table 1 presents data coverage w.r.t. the vocabulary size, on the target side.<br />
<br />
'''Setting'''<br />
<br />
As a baseline for English→French translation, the authors used the RNNsearch model proposed by (Bahdanau et al., 2014), with 30,000 source and target words and also another RNNsearch was trained for English→German translation with 50,000 source and target words. Using the proposed approach another set of RNNsearch models with much larger vocabularies of 500,000 source and target words was trained for each language pair. Different shortlist sizes used during training: 15,000 and 30,000 for English→French, and 15,000 and 50,000 for English→German. The best performance on the development set were evaluated and reported every twelve hours. For both language pairs, new models were trained with shortlist size of 15, 000 and 50, 000 by reshuffling the dataset at the beginning of each epoch. While this causes a non-negligible amount of overhead, such a change allows words to be contrasted with different sets of other words each epoch. The beam search was used to generate a translation given a source. The authors keep a set of 12 hypotheses and normalize probabilities by the length of the candidate sentences which was chosen to maximize the performance on the development set, for K ∈ {15k, 30k, 50k} and K0 ∈ {10, 20}. They test using a bilingual dictionary to accelerate decoding and to replace unknown words in translations.<br />
<br />
'''Results'''<br />
<br />
The results for English-> French translation obtained by the trained models with very large target vocabularies compared with results of previous models reported in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Google<br />
! Phrase-based SMT (cHO et al)<br />
! Phrase-based SMT (Durrani et al)<br />
|-<br />
| BASIC NMT<br />
| 29.97 (26.58)<br />
| 32.68 (28.76)<br />
| 30.6<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 33.08 (29.08)<br />
| 33.36 (29.32)<br />
34.11 (29.98)<br />
| -<br />
33.1<br />
| 33.3<br />
| 37.03<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 34.6 (30.53)<br />
| -<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Ensemble<br />
| -<br />
| 37.19 (31.98)<br />
| 37.5 <br />
| 33.3<br />
| 3703<br />
|-<br />
|}<br />
<br />
<br />
And the results for English->German translation in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Phrase-based SMT <br />
|-<br />
| BASIC NMT<br />
| 16.46 (17.13)<br />
| 16.95 (17.85)<br />
| 20.67<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 18.97 (19.16)<br />
| 17.46 (18.00)<br />
18.89 (19.03)<br />
| 20.67<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 19.4<br />
| 20.67<br />
|-<br />
| + Ensemble<br />
| -<br />
| 21.59<br />
| 20.67 <br />
|-<br />
|}<br />
<br />
It is clear that the RNNsearch-LV outperforms the baseline RNNsearch. In the case of the English→French task, RNNsearch-LV approached the performance level of the previous best single neural machine translation (NMT) model, even without any translationspecific techniques. With these, however, the RNNsearch-LV outperformed it. The performance of the RNNsearch-LV is also better than that of a standard phrase-based translation system. <br />
For English→German, the RNNsearch-LV outperformed the baseline before unknown word replacement, but after doing so, the two systems performed similarly. A higher large vocabulary single-model performance is achieved by reshuffling the dataset. In this case, we were able to surpass the previously reported best translation result on this task by building an ensemble of 8 models. With τ = 15, 000, the RNNsearch-LV performance worsened a little, with best BLEU scores, without reshuffling, of 33.76 and 18.59 respectively for English→French and English→German.<br />
<br />
The timing information of decoding for different models were presented in Table 3. While decoding from RNNsearch-LV with the full target vocabulary is slowest, the speed substantially improves if a candidate list for decoding each translation is used. <br />
The influence of the target vocabulary when translating the test sentences by using the union of a fixed set of 30, 000 common words and (at most) K0 likely candidates for each source word was evaluated for English→French with size of 30, 000. The performance of the system is comparable to the baseline when Uns not replaced, but there is not as much improvement when doing so.<br />
The authors found that K is inversely correlated with t. The single-model test BLEU scores for English→French with respect to the number of dictionary entries <math>k\prime</math> allowed for each source word is presented in figure below<br />
<br />
<br />
'''Conclusion'''<br />
<br />
Using the importance sampling an approach was proposed to be used in machine translation with a large target vocabulary without any substantial increase in computational complexity. The BLUE values for the proposed model showed translation performance comparable to the state-of-the-art translation systems on both the English→French task and English→German task.<br />
On English→French and English→German translation tasks, the neural machine translation models trained using the proposed method performed as well as, or better than, those using only limited sets of target words, even when replacing unknown words.</div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=26676on using very large target vocabulary for neural machine translation2015-11-20T00:38:39Z<p>Mgohari2: </p>
<hr />
<div>'''Overview'''<br />
<br />
This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation". The paper presents the application of importance sampling for neural machine translation with very large target vocabulary. Despite the advantages of neural networks in translation over the statistical machine translation systems, such as the phrase-based system, they suffer from some technical problems. Most importantly they are limited to work with a small vocabulary data set because of complexity and number of parameters have to be trained. The performance of the current neural nets are rapidly decrease if the number of unidentified words in this target vocabulary increase. In this paper Jean and his colleagues proposed a method of training based on the importance sampling which can uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrate better performance without losing the efficiency in time or speed.<br />
<br />
'''Methods'''<br />
<br />
Recall that the classic neural machine learning plays as encoder-decoder network. The encoder reads the source sentence x and encode it into a sequence of hidden states of h where <math>h_t=f(x_t,h_{t-1})</math>. In the decoder step, another neural network generates the translation vector of y based on the encoded sequence of hidden states h: <math>p(y_t\,|\,y_{<t},x)\propto exp\{q(y_{t-1}, z_t, c_t)\}</math> where <math>z_t=g(y_{t-1}, z_{t-1}, c_t)</math> and <math>c_t=r(z_{t-1}, h_1,..., H_T)</math><br />
<br />
The objective function which have to be maximized represented by <br />
<math>\theta=argmax\sum_{n=1}^{N}\sum_{t=1}^{T_n}logp(y_t^n\,|\,y_{<t}^n, x^n)</math><br />
<br />
where <math>(x^n, y^n)</math> is the n-th training pair of sentence, and <math>T_n</math> is the length of n-th target sentence <math>y^n</math>.<br />
The proposed model is based on specific implementation of neural machine translation that uses an attention mechanism, as recently proposed in <ref><br />
Bahdanau et<br />
al., 2014<br />
</ref>.<br />
In that the encoder is implemented by a bi-directional recurrent neural network,<math>h_t=[h_i^\leftarrow; h_t^\rightarrow]</math>. The decoder, at each time, computes the context<br />
vector ct as a convex sum of the hidden states (h1, . . . , hT ) with the coefficients (α1, . . . , αT) computed by<br />
<br />
<math>\alpha_t=\frac{exp\{a(h_t, z_t)\}}{\sum_{k}exp\{a(h_t, z_t)\}}</math><br />
where a is a feedforward neural network with a single hidden layer. <br />
Then the probability of the next target word is <br />
<br />
<math>p(y_t\ y_{<t}, x)=\frac{1}{Z} exp\{W_t^T\phi(y_{t-1}, z_t, c_t)+b_t\}</math>. In that <math>\phi</math> is an affine transformation followed by a nonlinear activation, <math>w_t</math> and <math>b_t</math> are the target word vector and the target word bias, respectively. Z is the normalization constant computed by<br />
<br />
<br />
<math> Z=\sum_{k:y_k\in V}exp\{W_t^T\phi(y_{t-1}, z_t, c_t)+b_t</math> where V is set of all the target words. <br />
<br />
<br />
The dot product between the feature <math>\phi(y_{t-1}, z_t, c_t)</math> and <math>w_t</math> is required to be done for all words in target vocabulary that is computationally complex and time consuming. <br />
The approach of this paper uses only a subset of sampled target words as a align vector to maximize Eq (6), instead of all the likely target words. The most naïve way to select a subset of target words is selection of K most frequent words. However, This skipping words from training processes is in contrast with using a large vocabulary, because practically we removed a bunch of words from target dictionary. Jean et al., proposed using an existing word alignment model to align the source and target words in the training corpus and build a dictionary. With the dictionary, for each source sentence, we construct a target word set consisting of the K-most frequent words (according to the estimated unigram probability) and, using the dictionary, at most <math>k\prime</math> likely target words for each source word. K and <math>k\prime</math> may be chosen either to meet the computational requirement or to maximize the translation performance on the development set. <br />
In order to avoid the growing complexity of computing the normalization constant, the authors proposed to use only a small subset <math>v\prime</math> of the target vocabulary at each update<ref><br />
Bengio and Sen´ ecal, 2008<br />
</ref>. <br />
Let us consider the gradient of the log probability of the output in conditional probability of <math>y_t</math>. The gradient is composed of a positive and negative part:<br />
<br />
<br />
<math>\bigtriangledown=logp(y_t|Y_{<t}, x_t)=\bigtriangledown \mathbf\varepsilon(y_t)-\sum_{k:y_k\in V} p(y_k|y_{<t}, x) \bigtriangledown \mathbf\varepsilon(y_t) </math><br />
where the energy <math>\mathbf\varepsilon</math> is defined as <math>\mathbf\varepsilon(y_i)=W_j^T\phi(y_{j-1}, Z_j, C_j)+b_j</math>. The second term of gradiant is in essence the expected gradiant of the energy as <math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)</math>. <br />
The idea of the proposed approach is to approximate this expectation of the gradient by importance sampling with a small number of samples. Given a predefined proposal distribution Q and a set <math>v\prime</math> of samples from Q, we approximate the expectation with <br />
<br />
<math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)\approx \sum_{k:y_k\in V\prime} \frac{w_k}{\sum_{k\prime:y_k\prime\in V\prime}w_k\prime}\epsilon(y_k)</math> where <math>w_k=exp{\epsilon(y_k)-log Q(y_k)}</math><br />
<br />
In practice, the training corpus is partitioned and a subset <math>v\prime</math> of the target vocabulary is defined for each partition prior to training. Before training begins, each target sentence in the training corpus is sequentially examined and accumulate unique target words until the number of unique target<br />
words reaches the predefined threshold τ . The accumulated vocabulary will be used for this partition of the corpus during training. This processes is repeated until the end of the training set is reached. <br />
0<br />
In this approach the alignments between the target words and source locations via the alignment model is obtained. This is useful when the model generated an Un token. Once a translation is generated given a source sentence, each Un may be replaced using a translation-specific technique based on the aligned source word. The authors in the experiment, replaced each ''Un'' token with the aligned source word or its most likely translation determined by another word alignment model.<br />
The proposed approach was evaluated in English->French and English-German translation. The neural machine translation model was trained by the bilingual, parallel corpora made available as part of WMT’14. The data sets were used for English to French were European v7, Common Crawl, UN, News Commentary, Gigaword. The data sets for English-German were Europarl v7, Common Crawl, News Commentary. <br />
The models were evaluated on the WMT’14 test set (news-test 2014)3 , while the concatenation of news-test-2012 and news-test-2013 is used for model selection (development set). Table 1 presents data coverage w.r.t. the vocabulary size, on the target side.<br />
<br />
'''Setting'''<br />
<br />
As a baseline for English→French translation, the authors used the RNNsearch model proposed by (Bahdanau et al., 2014), with 30,000 source and target words and also another RNNsearch was trained for English→German translation with 50,000 source and target words. Using the proposed approach another set of RNNsearch models with much larger vocabularies of 500,000 source and target words was trained for each language pair. Different shortlist sizes used during training: 15,000 and 30,000 for English→French, and 15,000 and 50,000 for English→German. The best performance on the development set were evaluated and reported every twelve hours. For both language pairs, new models were trained with shortlist size of 15, 000 and 50, 000 by reshuffling the dataset at the beginning of each epoch. While this causes a non-negligible amount of overhead, such a change allows words to be contrasted with different sets of other words each epoch. The beam search was used to generate a translation given a source. The authors keep a set of 12 hypotheses and normalize probabilities by the length of the candidate sentences which was chosen to maximize the performance on the development set, for K ∈ {15k, 30k, 50k} and K0 ∈ {10, 20}. They test using a bilingual dictionary to accelerate decoding and to replace unknown words in translations.<br />
<br />
'''Results'''<br />
<br />
The results for English-> French translation obtained by the trained models with very large target vocabularies compared with results of previous models reported in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Google<br />
! Phrase-based SMT (cHO et al)<br />
! Phrase-based SMT (Durrani et al)<br />
|-<br />
| BASIC NMT<br />
| 29.97 (26.58)<br />
| 32.68 (28.76)<br />
| 30.6<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 33.08 (29.08)<br />
| 33.36 (29.32)<br />
34.11 (29.98)<br />
| -<br />
33.1<br />
| 33.3<br />
| 37.03<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 34.6 (30.53)<br />
| -<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Ensemble<br />
| -<br />
| 37.19 (31.98)<br />
| 37.5 <br />
| 33.3<br />
| 3703<br />
|-<br />
|}<br />
<br />
<br />
And the results for English->German translation in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Phrase-based SMT <br />
|-<br />
| BASIC NMT<br />
| 16.46 (17.13)<br />
| 16.95 (17.85)<br />
| 20.67<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 18.97 (19.16)<br />
| 17.46 (18.00)<br />
18.89 (19.03)<br />
| 20.67<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 19.4<br />
| 20.67<br />
|-<br />
| + Ensemble<br />
| -<br />
| 21.59<br />
| 20.67 <br />
|-<br />
|}<br />
<br />
It is clear that the RNNsearch-LV outperforms the baseline RNNsearch. In the case of the English→French task, RNNsearch-LV approached the performance level of the previous best single neural machine translation (NMT) model, even without any translationspecific techniques. With these, however, the RNNsearch-LV outperformed it. The performance of the RNNsearch-LV is also better than that of a standard phrase-based translation system. <br />
For English→German, the RNNsearch-LV outperformed the baseline before unknown word replacement, but after doing so, the two systems performed similarly. A higher large vocabulary single-model performance is achieved by reshuffling the dataset. In this case, we were able to surpass the previously reported best translation result on this task by building an ensemble of 8 models. With τ = 15, 000, the RNNsearch-LV performance worsened a little, with best BLEU scores, without reshuffling, of 33.76 and 18.59 respectively for English→French and English→German.<br />
<br />
The timing information of decoding for different models were presented in Table 3. While decoding from RNNsearch-LV with the full target vocabulary is slowest, the speed substantially improves if a candidate list for decoding each translation is used. <br />
The influence of the target vocabulary when translating the test sentences by using the union of a fixed set of 30, 000 common words and (at most) K0 likely candidates for each source word was evaluated for English→French with size of 30, 000. The performance of the system is comparable to the baseline when Uns not replaced, but there is not as much improvement when doing so.<br />
The authors found that K is inversely correlated with t. The single-model test BLEU scores for English→French with respect to the number of dictionary entries <math>k\prime</math> allowed for each source word is presented in figure below<br />
<br />
<br />
'''Conclusion'''<br />
<br />
Using the importance sampling an approach was proposed to be used in machine translation with a large target vocabulary without any substantial increase in computational complexity. The BLUE values for the proposed model showed translation performance comparable to the state-of-the-art translation systems on both the English→French task and English→German task.<br />
On English→French and English→German translation tasks, the neural machine translation models trained using the proposed method performed as well as, or better than, those using only limited sets of target words, even when replacing unknown words.</div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=26665on using very large target vocabulary for neural machine translation2015-11-19T23:51:40Z<p>Mgohari2: Created page with "'''Overview''' This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation". The..."</p>
<hr />
<div>'''Overview'''<br />
<br />
This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation". The paper presents the application of importance sampling for neural machine translation with very large target vocabulary. Despite the advantages of neural networks in translation over the statistical machine translation systems, such as the phrase-based system, they suffer from some technical problems. Most importantly they are limited to work with a small vocabulary data set because of complexity and number of parameters have to be trained. The performance of the current neural nets are rapidly decrease if the number of unidentified words in this target vocabulary increase. In this paper Jean and his colleagues proposed a method of training based on the importance sampling which can uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrate better performance without losing the efficiency in time or speed.<br />
<br />
'''Methods'''<br />
<br />
Recall that the classic neural machine learning plays as encoder-decoder network. The encoder reads the source sentence x and encode it into a sequence of hidden states of h where <math>h_t=f(x_t,h_{t-1})</math>. In the decoder step, another neural network generates the translation vector of y based on the encoded sequence of hidden states h: <math>p(y_t\,|\,y_{<t},x)\propto exp\{q(y_{t-1}, z_t, c_t)\}</math> where <math>z_t=g(y_{t-1}, z_{t-1}, c_t)</math> and <math>c_t=r(z_{t-1}, h_1,..., H_T)</math><br />
<br />
The objective function which have to be maximized represented by <br />
<math>\theta=argmax\sum_{n=1}^{N}\sum_{t=1}^{T_n}logp(y_t^n\,|\,y_{<t}^n, x^n)</math><br />
<br />
where <math>(x^n, y^n)</math> is the n-th training pair of sentence, and <math>T_n</math> is the length of n-th target sentence <math>y^n</math>.<br />
The proposed model is based on specific implementation of neural machine translation that uses an attention mechanism, as recently proposed in <ref><br />
Bahdanau et<br />
al., 2014<br />
</ref>(Bahdanau et<br />
al., 2014).<br />
In that the encoder is implemented by a bi-directional recurrent neural network,<math>h_t=[h_i^\leftarrow; h_t^\rightarrow]</math>. The decoder, at each time, computes the context<br />
vector ct as a convex sum of the hidden states (h1, . . . , hT ) with the coefficients (α1, . . . , αT) computed by<br />
<math>\alpha_t=\{frac{exp\( /)}{a}</math><br />
<br />
<math>\{hg\}</math><br />
<math>\alpha_t=\frac{exp\(a(h_t, Z_{t-1})/)}{a}</math><br />
<br />
They used a gated recurrent unit for f (see, e.g.,<br />
(Cho et al., 2014b)).<br />
The decoder, at each time, computes the context<br />
vector ct as a convex sum of the hidden states<br />
(h1, . . . , hT ) with the coefficients α1, . . . , αT<br />
computed by<br />
αt =<br />
exp {a (ht<br />
, zt−1)}<br />
P<br />
k<br />
exp {a (hk, zt−1)}<br />
, (5)<br />
where a is a feedforward neural network with a<br />
single hidden layer<br />
<br />
<br />
The dot product between the feature <math>\phi(y_{t-1}, z_t, c_t)</math> and <math>w_t</math> is required to be done for all words in target vocabulary that is computationally complex and time consuming. <br />
The approach of this paper uses only a subset of sampled target words as a align vector to maximize Eq (6), instead of all the likely target words. The most naïve way to select a subset of target words is selection of K most frequent words. However, This skipping words from training processes is in contrast with using a large vocabulary, because practically we removed a bunch of words from target dictionary. Jean et al proposed using an existing word alignment model to align the source and target words in the training corpus and build a dictionary. With the dictionary, for each source sentence, we construct a target word set consisting of the K-most frequent words (according to the estimated unigram probability) and, using the dictionary, at most K1 likely target words for each source word. K and K0 may be chosen either to meet the computational requirement or to maximize the translation performance on the development set. <br />
In order to avoid the growing complexity of computing the normalization constant, the authors proposed to use only a small subset V0 of the target vocabulary at each update(Bengio and Sen´ ecal, 2008). <br />
Let us consider the gradient of the log probability of the output in conditional probability of <math>y_t</math>. The gradient is composed of a positive and negative part:<br />
<br />
<br />
<math>\bigtriangledown=logp(y_t|Y_{<t}, x_t)=\bigtriangledown \mathbf\epsilon(y_t)-\sum_{k:y_k\in V} p(y_k|y_{<t}, x) \bigtriangledown \mathbf\epsilon(y_t) </math><br />
where the energy \bigtriangledown is defined as <math>\bigtriangledown(y_i)=W_j^T\phi(y_{j-1}, Z_j, C_j)+b_j</math>. the second term of gradiant is in essence the expected gradiant of the energy as <math>E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)</math>. <br />
The idea of the proposed approach is to approximate this expectation of the gradient by importance sampling with a small number of samples. Given a predefined proposal distribution Q and a set V0 of samples from Q, we approximate the expectation with <br />
<br />
<math>E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)\approx \sum_{k:y_k\in V\prime} \frac{w_k}{\sum_{k\prime:y_k\prime\in V\prime}w_k\prime}\epsilon(y_k)</math> where <math>w_k=exp{\epsilon(y_k)-log Q(y_k)}</math><br />
<br />
In practice, the training corpus is partitioned and a subset <math>v\prime</math> of the target vocabulary is defined for each partition prior to training. Before training begins, each target sentence in the training corpus is sequentially examined and accumulate unique target words until the number of unique target<br />
words reaches the predefined threshold τ . The accumulated vocabulary will be used for this partition of the corpus during training. This processes is repeated until the end of the training set is reached. <br />
0<br />
In this approach the alignments between the target words and source locations via the alignment model in Eq(5) is obtained. This is useful when the model generated an Un token. Once a translation is generated given a source sentence, each Un may be replaced using a translation-specific technique based on the aligned source word. The authors in the experiment, replaced each Un token with the aligned source word or its most likely translation determined by another word alignment model.<br />
The proposed approach was evaluated in English->French and English-German translation. The neural machine translation model was trained by the bilingual, parallel corpora made available as part of WMT’14. The data sets were used for English to French were European v7, Common Crawl, UN, News Commentary, Gigaword. The data sets for English-German were Europarl v7, Common Crawl, News Commentary. <br />
The models were evaluated on the WMT’14 test set (news-test 2014)3 , while the concatenation of news-test-2012 and news-test-2013 is used for model selection (development set). Table 1 presents data coverage w.r.t. the vocabulary size, on the target side.<br />
<br />
'''Setting'''<br />
<br />
As a baseline for English→French translation, the authors used the RNNsearch model proposed by (Bahdanau et al., 2014), with 30,000 source and target words and also another RNNsearch was trained for English→German translation with 50,000 source and target words. Using the proposed approach another set of RNNsearch models with much larger vocabularies of 500,000 source and target words was trained for each language pair. Different shortlist sizes used during training: 15,000 and 30,000 for English→French, and 15,000 and 50,000 for English→German. The best performance on the development set were evaluated and reported every twelve hours. For both language pairs, new models were trained with shortlist size of 15, 000 and 50, 000 by reshuffling the dataset at the beginning of each epoch. While this causes a non-negligible amount of overhead, such a change allows words to be contrasted with different sets of other words each epoch. The beam search was used to generate a translation given a source. The authors keep a set of 12 hypotheses and normalize probabilities by the length of the candidate sentences which was chosen to maximize the performance on the development set, for K ∈ {15k, 30k, 50k} and K0 ∈ {10, 20}. They test using a bilingual dictionary to accelerate decoding and to replace unknown words in translations.<br />
<br />
'''Results'''<br />
<br />
The results for English-> French translation obtained by the trained models with very large target vocabularies compared with results of previous models reported in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Google<br />
! Phrase-based SMT (cHO et al)<br />
! Phrase-based SMT (Durrani et al)<br />
|-<br />
| BASIC NMT<br />
| 29.97 (26.58)<br />
| 32.68 (28.76)<br />
| 30.6<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 33.08 (29.08)<br />
| 33.36 (29.32)<br />
34.11 (29.98)<br />
| -<br />
33.1<br />
| 33.3<br />
| 37.03<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 34.6 (30.53)<br />
| -<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Ensemble<br />
| -<br />
| 37.19 (31.98)<br />
| 37.5 <br />
| 33.3<br />
| 3703<br />
|-<br />
|}<br />
<br />
<br />
And the results for English->German translation in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Phrase-based SMT <br />
|-<br />
| BASIC NMT<br />
| 16.46 (17.13)<br />
| 16.95 (17.85)<br />
| 20.67<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 18.97 (19.16)<br />
| 17.46 (18.00)<br />
18.89 (19.03)<br />
| 20.67<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 19.4<br />
| 20.67<br />
|-<br />
| + Ensemble<br />
| -<br />
| 21.59<br />
| 20.67 <br />
|-<br />
|}<br />
<br />
It is clear that the RNNsearch-LV outperforms the baseline RNNsearch. In the case of the English→French task, RNNsearch-LV approached the performance level of the previous best single neural machine translation (NMT) model, even without any translationspecific techniques. With these, however, the RNNsearch-LV outperformed it. The performance of the RNNsearch-LV is also better than that of a standard phrase-based translation system. <br />
For English→German, the RNNsearch-LV outperformed the baseline before unknown word replacement, but after doing so, the two systems performed similarly. A higher large vocabulary single-model performance is achieved by reshuffling the dataset. In this case, we were able to surpass the previously reported best translation result on this task by building an ensemble of 8 models. With τ = 15, 000, the RNNsearch-LV performance worsened a little, with best BLEU scores, without reshuffling, of 33.76 and 18.59 respectively for English→French and English→German.<br />
<br />
The timing information of decoding for different models were presented in Table 3. While decoding from RNNsearch-LV with the full target vocabulary is slowest, the speed substantially improves if a candidate list for decoding each translation is used. <br />
The influence of the target vocabulary when translating the test sentences by using the union of a fixed set of 30, 000 common words and (at most) K0 likely candidates for each source word was evaluated for English→French with size of 30, 000. The performance of the system is comparable to the baseline when Uns not replaced, but there is not as much improvement when doing so.<br />
The authors found that K is inversely correlated to τ. <br />
<br />
'''Conclusion'''<br />
<br />
Using the importance sampling an approach was proposed to be used in machine translation with a large target vocabulary without any substantial increase in computational complexity. The BLUE values for the proposed model showed translation performance comparable to the state-of-the-art translation systems on both the English→French task and English→German task.<br />
On English→French and English→German translation tasks, the neural machine translation models trained using the proposed method performed as well as, or better than, those using only limited sets of target words, even when replacing unknown words.</div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=f15Stat946PaperSignUp&diff=26511f15Stat946PaperSignUp2015-11-18T20:14:04Z<p>Mgohari2: </p>
<hr />
<div> <br />
=[https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/listofpapers1.pdf List of Papers]=<br />
<br />
= Record your contributions [https://docs.google.com/spreadsheets/d/1A_0ej3S6ns3bBMwWLS4pwA6zDLz_0Ivwujj-d1Gr9eo/edit?usp=sharing here:]=<br />
<br />
Use the following notations:<br />
<br />
S: You have written a summary on the paper<br />
<br />
T: You had technical contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
E: You had editorial contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
[http://goo.gl/forms/RASFRZXoxJ Your feedback on presentations]<br />
<br />
<br />
=Set A=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Oct 16 || pascal poupart || || Guest Lecturer||||<br />
|-<br />
|Oct 16 ||pascal poupart || ||Guest Lecturer ||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 ||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Oct 23 || Deepak Rishi || || Parsing natural scenes and natural language with recursive neural networks || [http://www-nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf Paper] || [[Parsing natural scenes and natural language with recursive neural networks | Summary]]<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 ||Rui Qiao || ||Going deeper with convolutions || [http://arxiv.org/pdf/1409.4842v1.pdf Paper]|| [[GoingDeeperWithConvolutions|Summary]]<br />
|-<br />
|Oct 30 ||Amirreza Lashkari|| 21 ||Overfeat: integrated recognition, localization and detection using convolutional networks. || [http://arxiv.org/pdf/1312.6229v4.pdf Paper]|| [[Overfeat: integrated recognition, localization and detection using convolutional networks|Summary]]<br />
|-<br />
|Mkeup Class (TBA) || Peter Blouw|| ||Memory Networks.|| [http://arxiv.org/abs/1410.3916]|| [[Memory Networks|Summary]]<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Anthony Caterini ||56 || Human-level control through deep reinforcement learning ||[http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf Paper]|| [[Human-level control through deep reinforcement learning|Summary]]<br />
|-<br />
|Nov 6 || Sean Aubin || ||Learning Hierarchical Features for Scene Labeling ||[http://yann.lecun.com/exdb/publis/pdf/farabet-pami-13.pdf Paper]||[[Learning Hierarchical Features for Scene Labeling|Summary]]<br />
|-<br />
|Nov 13|| Mike Hynes || 12 ||Speech recognition with deep recurrent neural networks || [http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf Paper] || [[Graves et al., Speech recognition with deep recurrent neural networks|Summary]]<br />
|-<br />
|Nov 13 || Tim Tse || || Question Answering with Subgraph Embeddings || [http://arxiv.org/pdf/1406.3676v3.pdf Paper] || [[Question Answering with Subgraph Embeddings | Summary ]]<br />
|-<br />
|Nov 13 || Maysum Panju || ||Neural machine translation by jointly learning to align and translate ||[http://arxiv.org/pdf/1409.0473v6.pdf Paper] || [[Neural Machine Translation: Jointly Learning to Align and Translate|Summary]]<br />
|-<br />
|Nov 13 || Abdullah Rashwan || || Deep neural networks for acoustic modeling in speech recognition. ||[http://research.microsoft.com/pubs/171498/HintonDengYuEtAl-SPM2012.pdf paper]|| [[Deep neural networks for acoustic modeling in speech recognition| Summary]]<br />
|-<br />
|Nov 20 || Valerie Platsko || ||Natural language processing (almost) from scratch. ||[http://arxiv.org/pdf/1103.0398.pdf Paper]|| [[Natural language processing (almost) from scratch. | Summary]]<br />
|-<br />
|Nov 20 || Brent Komer || ||Show, Attend and Tell: Neural Image Caption Generation with Visual Attention || [http://arxiv.org/pdf/1502.03044v2.pdf Paper]||[[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention|Summary]]<br />
|-<br />
|Nov 20 || Luyao Ruan || || Dropout: A Simple Way to Prevent Neural Networks from Overfitting || [https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf Paper]|| [[dropout | Summary]]<br />
|-<br />
|Nov 20 || Ali Mahdipour || || The human splicing code reveals new insights into the genetic determinants of disease ||[https://www.sciencemag.org/content/347/6218/1254806.full.pdf Paper] || [[Genetics | Summary]]<br />
|-<br />
|Nov 27 ||Mahmood Gohari || ||Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships ||[http://pubs.acs.org/doi/pdf/10.1021/ci500747n paper]||<br />
|-<br />
|Nov 27 || Derek Latremouille || ||Learning Fast Approximations of Sparse Coding || [http://yann.lecun.com/exdb/publis/pdf/gregor-icml-10.pdf Paper] ||<br />
|-<br />
|Nov 27 ||Xinran Liu || ||ImageNet Classification with Deep Convolutional Neural Networks ||[http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Paper]||[[ImageNet Classification with Deep Convolutional Neural Networks|Summary]]<br />
|-<br />
|Nov 27 ||Ali Sarhadi|| ||Strategies for Training Large Scale Neural Network Language Models||||<br />
|-<br />
|Dec 4 || Chris Choi || || On the difficulty of training recurrent neural networks || [http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf Paper] || [[On the difficulty of training recurrent neural networks | Summary]]<br />
|-<br />
|Dec 4 || Fatemeh Karimi || ||MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION||[http://arxiv.org/pdf/1412.7755v2.pdf Paper]||<br />
|-<br />
|Dec 4 || Jan Gosmann || || On the Number of Linear Regions of Deep Neural Networks || [http://arxiv.org/abs/1402.1869 Paper] || [[On the Number of Linear Regions of Deep Neural Networks | Summary]]<br />
|-<br />
|Dec 4 || Dylan Drover || 54 || Semi-supervised Learning with Deep Generative Models || [http://papers.nips.cc/paper/5352-semi-supervised-learning-with-deep-generative-models.pdf Paper] || [[Semi-supervised Learning with Deep Generative Models | Summary]]<br />
|-<br />
|}<br />
|}<br />
<br />
=Set B=<br />
<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Anthony Caterini ||15 ||The Manifold Tangent Classifier ||[http://papers.nips.cc/paper/4409-the-manifold-tangent-classifier.pdf Paper]|| [[The Manifold Tangent Classifier|Summary]]<br />
|-<br />
|Jan Gosmann || || Neural Turing machines || [http://arxiv.org/abs/1410.5401 Paper] || [[Neural Turing Machines|Summary]]<br />
|-<br />
|Brent Komer || || Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers || [http://arxiv.org/pdf/1202.2160v2.pdf Paper] || [[Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines|Summary]]<br />
|-<br />
|Sean Aubin || || Deep Sparse Rectifier Neural Networks || [http://jmlr.csail.mit.edu/proceedings/papers/v15/glorot11a/glorot11a.pdf Paper] || [[Deep Sparse Rectifier Neural Networks|Summary]]<br />
|-<br />
|Peter Blouw|| || Generating text with recurrent neural networks || [http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf Paper] || [[Generating text with recurrent neural networks|Summary]]<br />
|-<br />
|Tim Tse|| || From Machine Learning to Machine Reasoning || [http://research.microsoft.com/pubs/206768/mlj-2013.pdf Paper] || [[From Machine Learning to Machine Reasoning | Summary ]]<br />
|-<br />
|Rui Qiao|| || Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation || [http://arxiv.org/pdf/1406.1078v3.pdf Paper] || [[Learning Phrase Representations|Summary]]<br />
|-<br />
|Ftemeh Karimi|| 23 || Very Deep Convoloutional Networks for Large-Scale Image Recognition || [http://arxiv.org/pdf/1409.1556.pdf Paper] || [[Very Deep Convoloutional Networks for Large-Scale Image Recognition|Summary]]<br />
|-<br />
|Amirreza Lashkari|| 43 || Distributed Representations of Words and Phrases and their Compositionality || [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Paper] || [[Distributed Representations of Words and Phrases and their Compositionality|Summary]]<br />
|-<br />
|Xinran Liu|| 19 || Joint training of a convolutional network and a graphical model for human pose estimation || [http://papers.nips.cc/paper/5573-joint-training-of-a-convolutional-network-and-a-graphical-model-for-human-pose-estimation.pdf Paper] || [[Joint training of a convolutional network and a graphical model for human pose estimation|Summary]]<br />
|-<br />
|Chris Choi|| || Learning Long-Range Vision for Autonomous Off-Road Driving || [http://yann.lecun.com/exdb/publis/pdf/hadsell-jfr-09.pdf Paper] || [[Learning Long-Range Vision for Autonomous Off-Road Driving|Summary]]<br />
|-<br />
|Luyao Ruan|| || Deep Learning of the tissue-regulated splicing code || [http://bioinformatics.oxfordjournals.org/content/30/12/i121.full.pdf+html Paper] || [[Deep Learning of the tissue-regulated splicing code| Summary]]<br />
|-<br />
|Abdullah Rashwan|| || Deep Convolutional Neural Networks For LVCSR || [http://www.cs.toronto.edu/~asamir/papers/icassp13_cnn.pdf paper] || [[Deep Convolutional Neural Networks For LVCSR| Summary]]<br />
|-<br />
|Mahmood Gohari||37 || On using very large target vocabulary for neural machine translation || [http://arxiv.org/pdf/1412.2007v2.pdf paper] || [[On using very large target vocabulary for neural machine translation| Summary]]<br />
|-<br />
|Valerie Platsko|| || Learning Convolutional Feature Hierarchies for Visual Recognition || [http://papers.nips.cc/paper/4133-learning-convolutional-feature-hierarchies-for-visual-recognition Paper] || [[Learning Convolutional Feature Hierarchies for Visual Recognition | Summary]]<br />
|-<br />
|Derek Latremouille|| || The Wake-Sleep Algorithm for Unsupervised Neural Networks || [http://www.gatsby.ucl.ac.uk/~dayan/papers/hdfn95.pdf Paper] || [[The Wake-Sleep Algorithm for Unsupervised Neural Networks | Summary]]<br />
|-<br />
|Ri Wang|| || Continuous space language models || [https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenSemester2_2009_10/sdarticle.pdf Paper] || [[Continuous space language models | Summary]]<br />
|-<br />
|Deepak Rishi|| || Extracting and Composing Robust Features with Denoising Autoencoders || [http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf Paper] || [[Extracting and Composing Robust Features with Denoising Autoencoders | Summary]]<br />
|-<br />
|Maysum Panju|| || A fast learning algorithm for deep belief nets || [https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf Paper] || [[A fast learning algorithm for deep belief nets | Summary]]<br />
|-<br />
|Dylan Drover|| 53 || Deep Generative Stochastic Networks Trainable by Backprop || [http://jmlr.org/proceedings/papers/v32/bengio14.pdf Paper] || [[Deep Generative Stochastic Networks Trainable by Backprop| Summary]]</div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=f15Stat946PaperSignUp&diff=26510f15Stat946PaperSignUp2015-11-18T20:13:10Z<p>Mgohari2: </p>
<hr />
<div> <br />
=[https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/listofpapers1.pdf List of Papers]=<br />
<br />
= Record your contributions [https://docs.google.com/spreadsheets/d/1A_0ej3S6ns3bBMwWLS4pwA6zDLz_0Ivwujj-d1Gr9eo/edit?usp=sharing here:]=<br />
<br />
Use the following notations:<br />
<br />
S: You have written a summary on the paper<br />
<br />
T: You had technical contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
E: You had editorial contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
[http://goo.gl/forms/RASFRZXoxJ Your feedback on presentations]<br />
<br />
<br />
=Set A=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Oct 16 || pascal poupart || || Guest Lecturer||||<br />
|-<br />
|Oct 16 ||pascal poupart || ||Guest Lecturer ||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 ||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Oct 23 || Deepak Rishi || || Parsing natural scenes and natural language with recursive neural networks || [http://www-nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf Paper] || [[Parsing natural scenes and natural language with recursive neural networks | Summary]]<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 ||Rui Qiao || ||Going deeper with convolutions || [http://arxiv.org/pdf/1409.4842v1.pdf Paper]|| [[GoingDeeperWithConvolutions|Summary]]<br />
|-<br />
|Oct 30 ||Amirreza Lashkari|| 21 ||Overfeat: integrated recognition, localization and detection using convolutional networks. || [http://arxiv.org/pdf/1312.6229v4.pdf Paper]|| [[Overfeat: integrated recognition, localization and detection using convolutional networks|Summary]]<br />
|-<br />
|Mkeup Class (TBA) || Peter Blouw|| ||Memory Networks.|| [http://arxiv.org/abs/1410.3916]|| [[Memory Networks|Summary]]<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Anthony Caterini ||56 || Human-level control through deep reinforcement learning ||[http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf Paper]|| [[Human-level control through deep reinforcement learning|Summary]]<br />
|-<br />
|Nov 6 || Sean Aubin || ||Learning Hierarchical Features for Scene Labeling ||[http://yann.lecun.com/exdb/publis/pdf/farabet-pami-13.pdf Paper]||[[Learning Hierarchical Features for Scene Labeling|Summary]]<br />
|-<br />
|Nov 13|| Mike Hynes || 12 ||Speech recognition with deep recurrent neural networks || [http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf Paper] || [[Graves et al., Speech recognition with deep recurrent neural networks|Summary]]<br />
|-<br />
|Nov 13 || Tim Tse || || Question Answering with Subgraph Embeddings || [http://arxiv.org/pdf/1406.3676v3.pdf Paper] || [[Question Answering with Subgraph Embeddings | Summary ]]<br />
|-<br />
|Nov 13 || Maysum Panju || ||Neural machine translation by jointly learning to align and translate ||[http://arxiv.org/pdf/1409.0473v6.pdf Paper] || [[Neural Machine Translation: Jointly Learning to Align and Translate|Summary]]<br />
|-<br />
|Nov 13 || Abdullah Rashwan || || Deep neural networks for acoustic modeling in speech recognition. ||[http://research.microsoft.com/pubs/171498/HintonDengYuEtAl-SPM2012.pdf paper]|| [[Deep neural networks for acoustic modeling in speech recognition| Summary]]<br />
|-<br />
|Nov 20 || Valerie Platsko || ||Natural language processing (almost) from scratch. ||[http://arxiv.org/pdf/1103.0398.pdf Paper]|| [[Natural language processing (almost) from scratch. | Summary]]<br />
|-<br />
|Nov 20 || Brent Komer || ||Show, Attend and Tell: Neural Image Caption Generation with Visual Attention || [http://arxiv.org/pdf/1502.03044v2.pdf Paper]||[[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention|Summary]]<br />
|-<br />
|Nov 20 || Luyao Ruan || || Dropout: A Simple Way to Prevent Neural Networks from Overfitting || [https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf Paper]|| [[dropout | Summary]]<br />
|-<br />
|Nov 20 || Ali Mahdipour || || The human splicing code reveals new insights into the genetic determinants of disease ||[https://www.sciencemag.org/content/347/6218/1254806.full.pdf Paper] || [[Genetics | Summary]]<br />
|-<br />
|Nov 27 ||Mahmood Gohari || ||Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships ||[http://pubs.acs.org/doi/pdf/10.1021/ci500747n]||<br />
|-<br />
|Nov 27 || Derek Latremouille || ||Learning Fast Approximations of Sparse Coding || [http://yann.lecun.com/exdb/publis/pdf/gregor-icml-10.pdf Paper] ||<br />
|-<br />
|Nov 27 ||Xinran Liu || ||ImageNet Classification with Deep Convolutional Neural Networks ||[http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Paper]||[[ImageNet Classification with Deep Convolutional Neural Networks|Summary]]<br />
|-<br />
|Nov 27 ||Ali Sarhadi|| ||Strategies for Training Large Scale Neural Network Language Models||||<br />
|-<br />
|Dec 4 || Chris Choi || || On the difficulty of training recurrent neural networks || [http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf Paper] || [[On the difficulty of training recurrent neural networks | Summary]]<br />
|-<br />
|Dec 4 || Fatemeh Karimi || ||MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION||[http://arxiv.org/pdf/1412.7755v2.pdf Paper]||<br />
|-<br />
|Dec 4 || Jan Gosmann || || On the Number of Linear Regions of Deep Neural Networks || [http://arxiv.org/abs/1402.1869 Paper] || [[On the Number of Linear Regions of Deep Neural Networks | Summary]]<br />
|-<br />
|Dec 4 || Dylan Drover || 54 || Semi-supervised Learning with Deep Generative Models || [http://papers.nips.cc/paper/5352-semi-supervised-learning-with-deep-generative-models.pdf Paper] || [[Semi-supervised Learning with Deep Generative Models | Summary]]<br />
|-<br />
|}<br />
|}<br />
<br />
=Set B=<br />
<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Anthony Caterini ||15 ||The Manifold Tangent Classifier ||[http://papers.nips.cc/paper/4409-the-manifold-tangent-classifier.pdf Paper]|| [[The Manifold Tangent Classifier|Summary]]<br />
|-<br />
|Jan Gosmann || || Neural Turing machines || [http://arxiv.org/abs/1410.5401 Paper] || [[Neural Turing Machines|Summary]]<br />
|-<br />
|Brent Komer || || Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers || [http://arxiv.org/pdf/1202.2160v2.pdf Paper] || [[Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines|Summary]]<br />
|-<br />
|Sean Aubin || || Deep Sparse Rectifier Neural Networks || [http://jmlr.csail.mit.edu/proceedings/papers/v15/glorot11a/glorot11a.pdf Paper] || [[Deep Sparse Rectifier Neural Networks|Summary]]<br />
|-<br />
|Peter Blouw|| || Generating text with recurrent neural networks || [http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf Paper] || [[Generating text with recurrent neural networks|Summary]]<br />
|-<br />
|Tim Tse|| || From Machine Learning to Machine Reasoning || [http://research.microsoft.com/pubs/206768/mlj-2013.pdf Paper] || [[From Machine Learning to Machine Reasoning | Summary ]]<br />
|-<br />
|Rui Qiao|| || Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation || [http://arxiv.org/pdf/1406.1078v3.pdf Paper] || [[Learning Phrase Representations|Summary]]<br />
|-<br />
|Ftemeh Karimi|| 23 || Very Deep Convoloutional Networks for Large-Scale Image Recognition || [http://arxiv.org/pdf/1409.1556.pdf Paper] || [[Very Deep Convoloutional Networks for Large-Scale Image Recognition|Summary]]<br />
|-<br />
|Amirreza Lashkari|| 43 || Distributed Representations of Words and Phrases and their Compositionality || [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Paper] || [[Distributed Representations of Words and Phrases and their Compositionality|Summary]]<br />
|-<br />
|Xinran Liu|| 19 || Joint training of a convolutional network and a graphical model for human pose estimation || [http://papers.nips.cc/paper/5573-joint-training-of-a-convolutional-network-and-a-graphical-model-for-human-pose-estimation.pdf Paper] || [[Joint training of a convolutional network and a graphical model for human pose estimation|Summary]]<br />
|-<br />
|Chris Choi|| || Learning Long-Range Vision for Autonomous Off-Road Driving || [http://yann.lecun.com/exdb/publis/pdf/hadsell-jfr-09.pdf Paper] || [[Learning Long-Range Vision for Autonomous Off-Road Driving|Summary]]<br />
|-<br />
|Luyao Ruan|| || Deep Learning of the tissue-regulated splicing code || [http://bioinformatics.oxfordjournals.org/content/30/12/i121.full.pdf+html Paper] || [[Deep Learning of the tissue-regulated splicing code| Summary]]<br />
|-<br />
|Abdullah Rashwan|| || Deep Convolutional Neural Networks For LVCSR || [http://www.cs.toronto.edu/~asamir/papers/icassp13_cnn.pdf paper] || [[Deep Convolutional Neural Networks For LVCSR| Summary]]<br />
|-<br />
|Mahmood Gohari||37 || On using very large target vocabulary for neural machine translation || [http://arxiv.org/pdf/1412.2007v2.pdf paper] || [[On using very large target vocabulary for neural machine translation| Summary]]<br />
|-<br />
|Valerie Platsko|| || Learning Convolutional Feature Hierarchies for Visual Recognition || [http://papers.nips.cc/paper/4133-learning-convolutional-feature-hierarchies-for-visual-recognition Paper] || [[Learning Convolutional Feature Hierarchies for Visual Recognition | Summary]]<br />
|-<br />
|Derek Latremouille|| || The Wake-Sleep Algorithm for Unsupervised Neural Networks || [http://www.gatsby.ucl.ac.uk/~dayan/papers/hdfn95.pdf Paper] || [[The Wake-Sleep Algorithm for Unsupervised Neural Networks | Summary]]<br />
|-<br />
|Ri Wang|| || Continuous space language models || [https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenSemester2_2009_10/sdarticle.pdf Paper] || [[Continuous space language models | Summary]]<br />
|-<br />
|Deepak Rishi|| || Extracting and Composing Robust Features with Denoising Autoencoders || [http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf Paper] || [[Extracting and Composing Robust Features with Denoising Autoencoders | Summary]]<br />
|-<br />
|Maysum Panju|| || A fast learning algorithm for deep belief nets || [https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf Paper] || [[A fast learning algorithm for deep belief nets | Summary]]<br />
|-<br />
|Dylan Drover|| 53 || Deep Generative Stochastic Networks Trainable by Backprop || [http://jmlr.org/proceedings/papers/v32/bengio14.pdf Paper] || [[Deep Generative Stochastic Networks Trainable by Backprop| Summary]]</div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=genetics&diff=26509genetics2015-11-18T20:08:19Z<p>Mgohari2: </p>
<hr />
<div>'''<br />
== Genetic Application of Deep Learning ==<br />
'''<br />
This paper presentation is based on the paper [Hui Y. Xiong1 ''et al'', Science '''347''', 2015] which reveals the importance of deep learning methods in genetic study of disease while using different types of machine-learning approaches would enable us to precise annotation mechanism. These techniques have been done for a wide variety of disease including different cancers which has led to important achievements in mutation-driven splicing. t reach to this goal, various intronic and exonic disease mutations have taken into account to detect variants of mutations. This procedure should enable us to prognosis, diagnosis, and/or control a wide variety of diseases. <br />
<br />
<br />
<br />
<br />
'''<br />
== Introduction ==<br />
'''<br />
It has been a while since whole-genome sequencing been used to detect the source of disease or unwanted malignancies genetically. The idea is to find a hierarchy of mutations tending to such diseases by looking at alterations via genetic variations in the genome and particularly when they occur outside of those domains in which protein-coding happens. In the present paper, a computational method is given to detect those genetic variants which influence RNA splicing. RNA splicing is a modification of pre-messenger RNA (pre-mRNA) when introns are removed and makes the exons joined. Any type of interruptions on this important step of gene expression would lead to various kind of disease such as cancers and neurological disorders.<br />
<br />
[[File:Stat1.jpg]]<br />
<br />
'''<br />
<br />
== Rationale ==<br />
'''<br />
<br />
Deep learning algorithm is used to construct a computational model in which DNA sequences are inputs to predict splicing in human textures. In this model, test variants up to 300 nucleotides into an intron, can then be used to derive a score for variant alterations for splicing.<br />
<br />
[[File:Stat3.jpg]]<br />
<br />
<br />
<br />
'''<br />
== Materials and Methods ==<br />
'''<br />
<br />
The human splicing regulatory model is analyzed by Baysian machine learning method. 10,698 cassette exons has considered in this study as a training case. The goal is to maximize an information-theoretic code quality measure <math>CQ=\sum_e \sum_t D_{KL} (q_{t,e} | r_t ) - D_{KL} (q_{t,e} | p_{t,e} ) </math> where <math>q_{t,e}</math> is the target splicing pattern for exon in tissue t, <math> r_t </math> is the optimized guesser's prediction ignoring possible RNA features, <math>p_{t,e}</math> is the non-trained regulatory prediction on exons, and <math>D_{KL}</math> is the Kullback-Leibler between two distributions. CQ is, in fact, a likelihood function of <math>p_{t,e} </math>. <br />
<br />
The structure of each model is a two-layer neural network of units which are sigmoidal hidden within a considered tissue. In our special case study, nonlinear and texture-dependent correlation between the RNA features and the splicing has considered. In such a model, RNA features provide the inputs to 30 hidden variables at most. Each hidden variable is a sigmoidal non-linearity of its corresponding input. Then by applying a softmax function, the non-linear hidden variable are used to prepare the prediction. Moreover, tissues are also trained jointly as disjoint output units.<br />
<br />
Regarding the complexity of this approach, considering maximum likelihood learning method an overfitting is done for each model. The main learning algorithm applied in this paper are from <ref><br />
Xiong H.Y. ''et al'', Baysian Prediction of tissue-regulated splicing using RNA sequence and cellular context, Bioiformation 27, pp. 2554-2562, 2011.<br />
</ref>. As a generalization of logistic regression, the multinomial regression model has considered linear in log odds ratio domain and without hidden variables. Then the model is trained by taking into account the same objective function, RNA features, splicing patterns, and partitioning the dataset as the Baysian neutral network described in above.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
'''<br />
== Experimental Validation ==<br />
'''<br />
<br />
<br />
To check the accuracy of the suggested splicing regulatory model, in this research, experimental results of several data bases are used including RNA-seq data, ET-PCR data, RNA binding protein affinity data, splicing factor knockdown data, and phenotypic/genotypic data. <br />
<br />
[[File:Stat2.jpg]]<br />
<br /><br />
[[File:Stat6.jpg]]<br />
<br />
<br />
<br />
'''<br />
<br />
== Genome-wide Analysis ==<br />
'''<br />
<br />
As an important implications of genetic variation of splicing regulation, 658420 SNVs mapped to exonic and intronic sequences. Then the effect of each SNV on splicing regulation scored by applying the regulatory model of finding the largest value of the difference in predicted splicing level <math>\nabla \psi</math> across tissues.<br />
<br />
[[File:Stat5.jpg]]<br />
<br />
[[File:Stat8.jpg]]<br />
<br />
<br />
'''<br />
<br />
== Conclusion ==<br />
'''<br />
<br />
The method introduced in this paper represents a technique for disease-causing variants classification and for aberrant splicing malignancies. This computational model was trained to predict DNA sequence splicing in the absence of disease annotations or other existing population data and thus can be compared as a naive approach to the experimental data. Thus this model provides a method to understand the genetic basis of various diseases.<br />
We know there are several practical considerations when using Bayes Neural networks. For instance difficulty to speed up and scale up to a large number of hidden variables because of relying on methods like MCMC it is. Leung et al <ref><br />
Leung M, Deep learning of the tissue-regulated splicing code, Bioiformatics 30, 2014.<br />
</ref>. proposed an architecture that can have thousands of hidden units with multiple non-linear layers and millions of model parameters. <br />
<br />
[[File:Stat7.jpg]]<br />
<br />
'''<br />
== References ==<br />
'''<br />
<br />
[1] Hui Y. Xiong1 ''et al'', The human splicing code reveals new insights into the genetic determinants of disease, Science '''347''', 2015.<br />
<br />
[2] Xiong H.Y. ''et al'', Baysian Prediction of tissue-regulated splicing using RNA sequence and cellular context, Bioiformation '''27''', pp. 2554-2562, 2011.</div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=neural_Machine_Translation:_Jointly_Learning_to_Align_and_Translate&diff=26121neural Machine Translation: Jointly Learning to Align and Translate2015-11-11T22:43:13Z<p>Mgohari2: </p>
<hr />
<div>This summary is currently in progress.. Thank you for your patience!<br />
<br />
= Introduction =<br />
<br />
In this paper Bahdanau et al (2015) presents a new way of using neural networks to perform machine translation. Rather than using the typical RNN encoder-decoder model with fixed-length intermediate vectors, they proposed a method that uses a joint learning process for both alignment and translation, and does not restrict intermediate encoded vectors to any specific fixed length. The result is a translation method that is comparable in performance to phrase-based systems (the state-of-the-art effective models that do not use a neural network approach), additionally it has been found the proposed method is more effective compared to other neural network models when applied to long sentences.<br />
<br />
= Previous methods =<br />
<br />
In order to better appreciate the value of this paper's contribution, it is important to understand how earlier techniques approached the problem of machine translation using neural networks.<br />
<br />
In machine translation, the problem at hand is to identify the target sentence <math>y</math> (in natural language <math>B</math>) that is the most likely corresponding translation to the source sentence <math>x</math> (in natural language <math>A</math>). The authors compactly summarize this problem using the formula <math> \arg\max_{y} P(y|x)</math>.<br />
<br />
Recent Neural Network approaches proposed by researchers such as Kalchbrenner and Blunsom, Cho et al., Sutvesker et al. has built a neural machine translation to directly learn the conditional probability distribution between input <math>x</math> and output <math>y</math>. Experiments at current show that neural machine translation or extension of existing translation systems using RNNs perform better compared to state of the art systems.<br />
<br />
<br />
== Encoding ==<br />
<br />
Typically, the encoding step iterates through the input vectors in the representation of source sentence <math>x</math> and updates a hidden state with each new token in the input: <math>h_t = f(x_t, h_{t-1})</math>, for some nonlinear function <math>f</math>. After the entire input is read, the resulting fixed-length representation of the entire input sentence <math>x</math> is given by a nonlinear function <math>q</math> of all of the hidden states: <math>c = q(\{h_1, \ldots, h_{T_x}\})</math>. Different methods would use different nonlinear functions and different neural networks, but the essence of the approach is common to all.<br />
<br />
== Decoding == <br />
<br />
Decoding the fixed-length representation <math>c</math> of <math>x</math> is done by predicting one token of the target sentence <math>y</math> at a time, using the knowledge of all previously predicted words so far. The decoder defines a probability distribution over the possible sentences using a product of conditional probabilities <math>P(y) = \Pi_t P(y_t|\{y_1, \ldots, y_{t-1}\},c)</math>. <br />
<br />
In the neural network approach, the conditional probability of the next output term given the previous ones <math>P(y_t | \{y_1, \ldots, y_{t-1}\},c)</math> is given by the evaluation of a nonlinear function <math>g(y_{t-1}, s_t, c)</math>, where <math>s_t</math> is the hidden state of the RNN.<br />
<br />
= The proposed method =<br />
<br />
The method proposed here is different from the traditional approach because it bypasses the fixed-length context vector <math>c</math> altogether, and instead aligns the tokens of the translated sentence <math>y</math> directly with the corresponding tokens of source sentence <math>x</math> as it decides which parts might be most relevant. To accommodate this, a different neural network structure needs to be set up.<br />
<br />
== Encoding ==<br />
<br />
The proposed model does not use an ordinary recursive neural network to encode the target sentence <math>x</math>, but instead uses a bidirectional recursive neural network (BiRNN): this is a model that consists of both a forward and backward RNN, where the forward RNN takes the input tokens of <math>x</math> in the correct order when computing hidden states, and the backward RNN takes the tokens in reverse. Thus each token of <math>x</math> is associated with two hidden states, corresponding to the states it produces in the two RNNs. The annotation vector <math>h_j</math> of the token <math>x_j</math> in <math>x</math> is given by the concatenation of these two hidden states vectors.<br />
<br />
== Aligment ==<br />
<br />
An alignment model (in the form of a neural network) is used to measure how well each annotation <math>h_j</math> of the input sentence corresponds to the current state of constructing the translated sentence (represented by the vector <math>s_{i-1}</math>, the hidden state of the RNN that identifies the tokens in the output sentence <math>y</math>. This is stored as the energy score <math>e_{ij} = a(s_{i-1}, h_j)</math>. <br />
<br />
The energy scores from the alignment process are used to assign weights <math>\alpha_{ij}</math> to the annotations, effectively trying to determine which of the words in the input is most likely to correspond to the next word that needs to be translated in the current stage of the output sequence:<br />
<br />
<math>\alpha_{ij} = \frac{\exp(e_{ij})}{\Sigma_k \exp(e_{ik})}</math><br />
<br />
The weights are then applied to the annotations to obtain the current context vector input: <br />
<br />
<math>c_i = \Sigma_j \alpha_{ij}h_j</math><br />
<br />
Note that this is where we see one major difference between the proposed method and the previous ones: The context vector, or the representation of the input sentence, is not one fixed-length static vector <math>c</math>; rather, every time we translate a new word in the sentence, a new representation vector <math>c_i</math> is produced. This vector depends on the most relevant words in the source sentence to the current state in the translation (hence it is automatically aligning) and allows the input sentence to have a variable length representation (since each annotation in the input representation produces a new context vector <math>c_i</math>).<br />
<br />
== Decoding ==<br />
<br />
The decoding is done by using an RNN to model a probability distribution on the conditional probabilities <br />
<br />
<math>P(y_i | y_1, \ldots, y_{i-1}, x) = g(y_{i-1}, s_i, c_i)</math><br />
<br />
where here, <math>s_i</math> is the RNN hidden state at the previous time step, and <math>c_i</math> is the current context vector representation as discussed above under Alignment.<br />
<br />
Once the encoding and alignment are done, the decoding step is fairly straightforward and corresponds with the typical approach of neural network translation systems, although the context vector representation is now different at each step of the translation.<br />
<br />
== Experiment Settings == <br />
The ACL WMT '14 dataset containing English to French translation were used to assess the performance of the Bahdanau et al(2015)'s <ref>Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).</ref> RNNSearch and RNN Encoder-Decoder proposed by Cho et al (2014) <ref>Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014a).<br />
Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014).</ref>. <br />
<br />
The WMT '14 dataset actually contains the following corpora, totaling (850M words):<br />
* Europarl (61M words)<br />
* News Commentary (5.5M words)<br />
* UN (421M words) <br />
* Crawled corpora (90M and 272.5 words)<br />
<br />
This was reduced to 348M using data selection method described by Axelord, et al (2011)<ref>Axelrod, A., He, X., and Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 355–362. Association for Computational Linguistics.</ref>.<br />
<br />
Both models were trained in the same manner, by using minibatch stochastic gradient descent (SGD) with size 80 and AdaDelta. Once the model has finished training, beam search is used to decode the computed probability distribution to obtain a translation output.<br />
<br />
= Results =<br />
<br />
The authors performed some experiments using the proposed model of machine translation, calling it "RNNsearch", in comparison with the previous kind of model, referred to as "RNNencdec". Both models were trained on the same datasets for translating English to French, with one dataset containing sentences of length up to 30 words, and the other containing sentences with at most 50. <br />
<br />
Quantitatively, the RNNsearch scores exceed RNNencdec by a clear margin. The distinction is particularly strong in longer sentences, which the authors note to be a problem area for RNNencdec -- information gets lost when trying to "squash" long sentences into fixed-length vector representations.<br />
<br />
The following graph, provided in the paper, shows the performance of RNNsearch compared with RNNencdec, based on the BLEU scores for evaluating machine translation.<br />
<br />
[[File:RNNsearch_Graph.jpg]]<br />
<br />
Qualitatively, the RNNsearch method does a good job of aligning words in the translation process, even when they need to be rearranged in the translated sentence. Long sentences are also handled very well: while RNNencdec is shown to typically lose meaning and effectiveness after a certain number of words into the sentence, RNNsearch seems robust and reliable even for unusually long sentences.<br />
<br />
= Conclusion and Comments =<br />
<br />
Overall, the algorithm proposed by the authors gives a new and seemingly useful approach towards machine translation, particularly for translating long sentences.<br />
<br />
The performance appears to be good, but it would be interesting to see if it can be maintained when translating between languages that are not as closely aligned naturally as English and French usually are. The authors briefly refer to other languages (such as German) but do not provide any experiments or detailed comments to describe how the algorithm would perform in such cases. <br />
<br />
It is also interesting to note that, while the performance was always shown to be better for RNNsearch than for the older RNNencdec model, the former also includes more hidden units overall in its models than the latter. RNNencdec was mentioned as having 1000 hidden units for each of its encoding and decoding RNNs, giving a total of 2000; meanwhile, RNNsearch had 1000 hidden units for each the forward and backward RNNs in encoding, as well as 1000 more for the decoding RNN, giving a total of 3000. This is perhaps a worthy point to take into consideration when judging the relative performance of the two models objectively.<br />
<br />
Compare to some other algorithms, the performance of proposed algorithm for rare words, even in English to French translation is not good enough. For long sentences with large number of rare words the algorithm which uses a deep LSTM to encode the input sequence and a separate deep LSTM to output the translation works more accurate with larger BLEU score. <ref> Sutskever I, Le Q, Vinyals O, Zaremba W (1997).Addressing the Rare Word Problem in<br />
Neural Machine Translation, </ref>,. <br />
Another approach to explaining the performance gains of RNNsearch over RNNencdec is due to RNNsearch's usage of the Bi-Directional RNN (BiRNN) as both encoder and decoder. As explained by Schuster and Paliwal (1997) <ref>Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on, 45 (11), 2673–2681</ref>, compared to traditional RNN which only explores past data, BiRNN considers both past and future contexts.</div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=neural_Machine_Translation:_Jointly_Learning_to_Align_and_Translate&diff=26120neural Machine Translation: Jointly Learning to Align and Translate2015-11-11T22:42:21Z<p>Mgohari2: </p>
<hr />
<div>This summary is currently in progress.. Thank you for your patience!<br />
<br />
= Introduction =<br />
<br />
In this paper Bahdanau et al (2015) presents a new way of using neural networks to perform machine translation. Rather than using the typical RNN encoder-decoder model with fixed-length intermediate vectors, they proposed a method that uses a joint learning process for both alignment and translation, and does not restrict intermediate encoded vectors to any specific fixed length. The result is a translation method that is comparable in performance to phrase-based systems (the state-of-the-art effective models that do not use a neural network approach), additionally it has been found the proposed method is more effective compared to other neural network models when applied to long sentences.<br />
<br />
= Previous methods =<br />
<br />
In order to better appreciate the value of this paper's contribution, it is important to understand how earlier techniques approached the problem of machine translation using neural networks.<br />
<br />
In machine translation, the problem at hand is to identify the target sentence <math>y</math> (in natural language <math>B</math>) that is the most likely corresponding translation to the source sentence <math>x</math> (in natural language <math>A</math>). The authors compactly summarize this problem using the formula <math> \arg\max_{y} P(y|x)</math>.<br />
<br />
Recent Neural Network approaches proposed by researchers such as Kalchbrenner and Blunsom, Cho et al., Sutvesker et al. has built a neural machine translation to directly learn the conditional probability distribution between input <math>x</math> and output <math>y</math>. Experiments at current show that neural machine translation or extension of existing translation systems using RNNs perform better compared to state of the art systems.<br />
<br />
<br />
== Encoding ==<br />
<br />
Typically, the encoding step iterates through the input vectors in the representation of source sentence <math>x</math> and updates a hidden state with each new token in the input: <math>h_t = f(x_t, h_{t-1})</math>, for some nonlinear function <math>f</math>. After the entire input is read, the resulting fixed-length representation of the entire input sentence <math>x</math> is given by a nonlinear function <math>q</math> of all of the hidden states: <math>c = q(\{h_1, \ldots, h_{T_x}\})</math>. Different methods would use different nonlinear functions and different neural networks, but the essence of the approach is common to all.<br />
<br />
== Decoding == <br />
<br />
Decoding the fixed-length representation <math>c</math> of <math>x</math> is done by predicting one token of the target sentence <math>y</math> at a time, using the knowledge of all previously predicted words so far. The decoder defines a probability distribution over the possible sentences using a product of conditional probabilities <math>P(y) = \Pi_t P(y_t|\{y_1, \ldots, y_{t-1}\},c)</math>. <br />
<br />
In the neural network approach, the conditional probability of the next output term given the previous ones <math>P(y_t | \{y_1, \ldots, y_{t-1}\},c)</math> is given by the evaluation of a nonlinear function <math>g(y_{t-1}, s_t, c)</math>, where <math>s_t</math> is the hidden state of the RNN.<br />
<br />
= The proposed method =<br />
<br />
The method proposed here is different from the traditional approach because it bypasses the fixed-length context vector <math>c</math> altogether, and instead aligns the tokens of the translated sentence <math>y</math> directly with the corresponding tokens of source sentence <math>x</math> as it decides which parts might be most relevant. To accommodate this, a different neural network structure needs to be set up.<br />
<br />
== Encoding ==<br />
<br />
The proposed model does not use an ordinary recursive neural network to encode the target sentence <math>x</math>, but instead uses a bidirectional recursive neural network (BiRNN): this is a model that consists of both a forward and backward RNN, where the forward RNN takes the input tokens of <math>x</math> in the correct order when computing hidden states, and the backward RNN takes the tokens in reverse. Thus each token of <math>x</math> is associated with two hidden states, corresponding to the states it produces in the two RNNs. The annotation vector <math>h_j</math> of the token <math>x_j</math> in <math>x</math> is given by the concatenation of these two hidden states vectors.<br />
<br />
== Aligment ==<br />
<br />
An alignment model (in the form of a neural network) is used to measure how well each annotation <math>h_j</math> of the input sentence corresponds to the current state of constructing the translated sentence (represented by the vector <math>s_{i-1}</math>, the hidden state of the RNN that identifies the tokens in the output sentence <math>y</math>. This is stored as the energy score <math>e_{ij} = a(s_{i-1}, h_j)</math>. <br />
<br />
The energy scores from the alignment process are used to assign weights <math>\alpha_{ij}</math> to the annotations, effectively trying to determine which of the words in the input is most likely to correspond to the next word that needs to be translated in the current stage of the output sequence:<br />
<br />
<math>\alpha_{ij} = \frac{\exp(e_{ij})}{\Sigma_k \exp(e_{ik})}</math><br />
<br />
The weights are then applied to the annotations to obtain the current context vector input: <br />
<br />
<math>c_i = \Sigma_j \alpha_{ij}h_j</math><br />
<br />
Note that this is where we see one major difference between the proposed method and the previous ones: The context vector, or the representation of the input sentence, is not one fixed-length static vector <math>c</math>; rather, every time we translate a new word in the sentence, a new representation vector <math>c_i</math> is produced. This vector depends on the most relevant words in the source sentence to the current state in the translation (hence it is automatically aligning) and allows the input sentence to have a variable length representation (since each annotation in the input representation produces a new context vector <math>c_i</math>).<br />
<br />
== Decoding ==<br />
<br />
The decoding is done by using an RNN to model a probability distribution on the conditional probabilities <br />
<br />
<math>P(y_i | y_1, \ldots, y_{i-1}, x) = g(y_{i-1}, s_i, c_i)</math><br />
<br />
where here, <math>s_i</math> is the RNN hidden state at the previous time step, and <math>c_i</math> is the current context vector representation as discussed above under Alignment.<br />
<br />
Once the encoding and alignment are done, the decoding step is fairly straightforward and corresponds with the typical approach of neural network translation systems, although the context vector representation is now different at each step of the translation.<br />
<br />
== Experiment Settings == <br />
The ACL WMT '14 dataset containing English to French translation were used to assess the performance of the Bahdanau et al(2015)'s <ref>Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).</ref> RNNSearch and RNN Encoder-Decoder proposed by Cho et al (2014) <ref>Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014a).<br />
Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014).</ref>. <br />
<br />
The WMT '14 dataset actually contains the following corpora, totaling (850M words):<br />
* Europarl (61M words)<br />
* News Commentary (5.5M words)<br />
* UN (421M words) <br />
* Crawled corpora (90M and 272.5 words)<br />
<br />
This was reduced to 348M using data selection method described by Axelord, et al (2011)<ref>Axelrod, A., He, X., and Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 355–362. Association for Computational Linguistics.</ref>.<br />
<br />
Both models were trained in the same manner, by using minibatch stochastic gradient descent (SGD) with size 80 and AdaDelta. Once the model has finished training, beam search is used to decode the computed probability distribution to obtain a translation output.<br />
<br />
= Results =<br />
<br />
The authors performed some experiments using the proposed model of machine translation, calling it "RNNsearch", in comparison with the previous kind of model, referred to as "RNNencdec". Both models were trained on the same datasets for translating English to French, with one dataset containing sentences of length up to 30 words, and the other containing sentences with at most 50. <br />
<br />
Quantitatively, the RNNsearch scores exceed RNNencdec by a clear margin. The distinction is particularly strong in longer sentences, which the authors note to be a problem area for RNNencdec -- information gets lost when trying to "squash" long sentences into fixed-length vector representations.<br />
<br />
The following graph, provided in the paper, shows the performance of RNNsearch compared with RNNencdec, based on the BLEU scores for evaluating machine translation.<br />
<br />
[[File:RNNsearch_Graph.jpg]]<br />
<br />
Qualitatively, the RNNsearch method does a good job of aligning words in the translation process, even when they need to be rearranged in the translated sentence. Long sentences are also handled very well: while RNNencdec is shown to typically lose meaning and effectiveness after a certain number of words into the sentence, RNNsearch seems robust and reliable even for unusually long sentences.<br />
<br />
= Conclusion and Comments =<br />
<br />
Overall, the algorithm proposed by the authors gives a new and seemingly useful approach towards machine translation, particularly for translating long sentences.<br />
<br />
The performance appears to be good, but it would be interesting to see if it can be maintained when translating between languages that are not as closely aligned naturally as English and French usually are. The authors briefly refer to other languages (such as German) but do not provide any experiments or detailed comments to describe how the algorithm would perform in such cases. <br />
<br />
It is also interesting to note that, while the performance was always shown to be better for RNNsearch than for the older RNNencdec model, the former also includes more hidden units overall in its models than the latter. RNNencdec was mentioned as having 1000 hidden units for each of its encoding and decoding RNNs, giving a total of 2000; meanwhile, RNNsearch had 1000 hidden units for each the forward and backward RNNs in encoding, as well as 1000 more for the decoding RNN, giving a total of 3000. This is perhaps a worthy point to take into consideration when judging the relative performance of the two models objectively.<br />
<br />
Compare to some other algorithms, the performance of proposed algorithm for rare words, even in English to French translation is not good enough. For long sentences with large number of rare words the algorithm which uses a deep LSTM to encode the input sequence and a separate deep LSTM to output the translation works more accurate with larger BLEU <ref> Sutskever I, Le Q, Vinyals O, Zaremba W (1997).Addressing the Rare Word Problem in<br />
Neural Machine Translation, </ref>,. <br />
Another approach to explaining the performance gains of RNNsearch over RNNencdec is due to RNNsearch's usage of the Bi-Directional RNN (BiRNN) as both encoder and decoder. As explained by Schuster and Paliwal (1997) <ref>Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on, 45 (11), 2673–2681</ref>, compared to traditional RNN which only explores past data, BiRNN considers both past and future contexts.</div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=show,_Attend_and_Tell:_Neural_Image_Caption_Generation_with_Visual_Attention&diff=26119show, Attend and Tell: Neural Image Caption Generation with Visual Attention2015-11-11T22:08:18Z<p>Mgohari2: /* Motivation */</p>
<hr />
<div>= Introduction =<br />
<br />
This paper introduces an attention based model that automatically learns to describe the content of images. It is able to focus on salient parts of the image while generating the corresponding word in the output sentence. A visualization is provided showing which part of the image was attended to to generate each specific word in the output. This can be used to get a sense of what is going on in the model and is especially useful for understanding the kinds of mistakes it makes. The model is tested on three datasets, Flickr8k, Flickr30k, and MS COCO.<br />
<br />
= Motivation =<br />
Caption generation and compressing huge amounts of salient visual information into descriptive language were recently improved by combination of convolutional neural networks and recurrent neural networks. . Using representations from the top layer of a convolutional net that distill information in image down to the most salient objects can lead to losing information which could be useful for richer, more descriptive captions. Retaining this information using more low-level representation was the motivation for the current work.<br />
<br />
= Contributions = <br />
<br />
* Two attention-based image caption generators using a common framework. A "soft" deterministic attention mechanism and a "hard" stochastic mechanism.<br />
* Show how to gain insight and interpret results of this framework by visualizing "where" and "what" the attention focused on.<br />
* Quantitatively validate the usefulness of attention in caption generation with state of the art performance on three datasets (Flickr8k, Flickr30k, and MS COCO)<br />
<br />
= Previous Work =<br />
<br />
= Model =<br />
<br />
[[File:AttentionNetwork.png]]<br />
<br />
== Encoder: Convolutional Features ==<br />
<br />
Model takes in a single image and generates a caption of arbitrary length. The caption is a sequence of one-hot encoded words from a given vocabulary.<br />
<br />
== Decoder: Long Short-Term Memory Network ==<br />
<br />
[[File:AttentionLSTM.png]]<br />
<br />
== Properties ==<br />
<br />
"where" the network looks next depends on the sequence of words that has already been generated.<br />
<br />
The attention framework learns latent alignments from scratch instead of explicitly using object detectors. This allows the model to go beyond "objectness" and learn to attend to abstract concepts.<br />
<br />
[[File:AttentionHighlights.png]]<br />
<br />
== Training ==<br />
<br />
Two regularization techniques were used, used drop out and early stopping on BLEU score.<br />
<br />
The MS COCO dataset has more than 5 reference sentences for some of the images, while the Flickr datasets have exactly 5. For consistency, the reference sentences for all images in the MS COCO dataset was truncated to 5. There was also some basic tokenization applied to the MS COCO dataset to be consistent with the tokenization in the Flickr datasets.<br />
<br />
On the largest dataset (MS COCO) the attention model took less than 3 days to train on NVIDIA Titan Black GPU.<br />
<br />
= Results =<br />
<br />
Results reported with the BLEU and METEOR metrics.<br />
<br />
[[File:AttentionResults.png]]<br />
<br />
[[File:AttentionGettingThingsRight.png]]<br />
<br />
[[File:AttentionGettingThingsWrong.png]]<br />
<br />
=References=<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=graves_et_al.,_Speech_recognition_with_deep_recurrent_neural_networks&diff=26118graves et al., Speech recognition with deep recurrent neural networks2015-11-11T21:11:08Z<p>Mgohari2: </p>
<hr />
<div>= Overview =<br />
<br />
This document is a summary of the paper ''Speech recognition with deep recurrent neural networks'' by A. Graves, A.-R. Mohammed, and G. Hinton, which appeared in the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). The first and third authors are Artificial Neural Network (ANN) researchers, while Mohammed works in the field of automatic speech recognition.<br />
<br />
The paper presents the application of bidirectional multilayer Long Short-term Memory (LSTM) ANNs with 1–5 layers to phoneme recognition on the TIMIT acoustic phonme corpus, which is the standard benchmark in the field of acoustic recognition, extending the previous work by Mohammed and Hinton on this topic using Deep Belief Networks. The TIMIT corpus contains audio recordings of 6300 sentences spoken by 630 (American) English speakers from 8 regions with distinct dialects, where each recording has accompanying manually labelled transcriptions of the phonemes in the audio clips alongside timestamp information. The empirical classification accuracies reported in the literature before the publication of this paper are shown in the timeline below (note that in this figure, the accuracy metric is 100% - PER, where PER is the phoneme classification error rate).<br />
<br />
The deep LSTM networks presented with 3 or more layers obtain phoneme classification error rates of 19.6% or less, with one model obtaining 17.7%, which was the best result reported in the literature at the time, outperforming the previous record of 20.7% achieved by Mohammed et al. Furthermore, the error rate decreases monotonically with LSTM network depth for 1–5 layers. While the bidirectional LSTM model performs well on the TIMIT corpus, any potential advantage of bidirectional over unidirectional LSTM network models, cannot be determined from this paper since the performance comparison is across different numbers of iterations taken in the optimization algorithm used to train the models, and multiple trials for statistical validity were not performed.<br />
<br />
<br />
[[File:timit.png | frame | center |Timeline of percentage phoneme recognition accuracy achieved on the core TIMIT corpus, from Lopes and Perdigao, 2011. ]]<br />
<br />
== Motivation ==<br />
Neural networks have been trained for speech recognition problems, however usually in combination with hidden Markov Models. The authors in this paper argue that given the nature of speech is an inherently dynamic process RNN should be the ideal choice for such a problem. There has been attempts to train RNNs for speech recognition <ref>A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Un-segmented Sequence Data with Recurrent Neural Networks,” in ICML, Pittsburgh, USA, 2006.</ref> <ref> A. Graves, Supervised sequence labelling with recurrentneural networks, vol. 385, Springer, 2012.</ref> <ref> A. Graves, “Sequence transduction with recurrent neural networks,” in ICML Representation Learning Work-sop, 2012.</ref> and RNNs with LSTM for recognizing cursive handwriting <ref> A. Graves, S. Fernandez, M. Liwicki, H. Bunke, and J. Schmidhuber, “Unconstrained Online Handwriting Recognition with Recurrent Neural Networks,” in NIPS.2008.</ref> but neither has made an impact on the speech recognition. The authors drew inspiration from Convolutional Neural Networks, where multiple layers are stacked on top of each other to combine LSTM and RNNs together.<br />
<br />
However instead of using a conventional RNN which only considers previous contexts, a Bidirectional RNN <ref> M. Schuster and K. K. Paliwal, “Bidirectional Recurrent Neural Networks,”IEEE Transactions on Signal Processing, vol. 45, pp. 2673–2681, 1997.</ref> was used to consider both forward and backward contexts. This is due in part because the authors saw no reason not to exploit future contexts since the speech utterances are transcribed at once. Additionally BRNN has the added benefit of being able to consider the entire forward and context, not just some predefined window of forward and backward contexts.<br />
<br />
[[File:brnn.png|center|600px]]<br />
<br />
= Deep RNN models considered by Graves et al. =<br />
<br />
In this paper, Graves et al. use deep LSTM network models. We briefly review recurrent neural networks, which form the basis of the more complicated LSTM network that has composite <math>\mathcal{H}</math> functions instead of sigmoids and additional parameter vectors associated with the ''state'' of each neuron. Finally, a description of ''bidirectional'' ANNs is given, which is used throughout the numerical experiments.<br />
<br />
== Recurrent Neural Networks ==<br />
<br />
Recall that a standard 1-layer recurrent neural network (RNN) computes the hidden vector sequence <math>{\boldsymbol h} = ({{\mathbf{h}}}_1,\ldots,{{\mathbf{h}}}_T)</math> and output vector sequence <math>{{\boldsymbol {{\mathbf{y}}}}}= ({{\mathbf{y}}}_1,\ldots,{{\mathbf{y}}}_T)</math> from an input vector sequence <math>{{\boldsymbol {{\mathbf{x}}}}}= ({{\mathbf{x}}}_1,\ldots,{{\mathbf{x}}}_T)</math> through the following equation where the index is from <math>t=1</math> to <math>T</math>:<br />
<br />
<math>{{\mathbf{h}}}_t = \begin{cases}<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{\mathbf{b_{h}}}}}\right) &\quad t = 1\\<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{{\mathbf{W}}}_{h h}}}{{\mathbf{h}}}_{t-1} + {{{\mathbf{b_{h}}}}}\right) &\quad \text{else}<br />
\end{cases}</math><br />
<br />
and<br />
<br />
<math>{{\mathbf{y}}}_t = {{{{\mathbf{W}}}_{h y}}}{{\mathbf{h}}}_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The <math>{{\mathbf{W}}}</math> terms are the parameter matrices with subscripts denoting the layer location (<span>e.g. </span><math>{{{{\mathbf{W}}}_{x h}}}</math> is the input-hidden weight matrix), and the offset <math>b</math> terms are bias vectors with appropriate subscripts (<span>e.g. </span><math>{{{\mathbf{b_{h}}}}}</math> is hidden bias vector). The function <math>{\mathcal{H}}</math> is an elementwise vector function with a range of <math>[0,1]</math> for each component in the hidden layer.<br />
<br />
This paper considers multilayer RNN architectures, with the same hidden layer function used for all <math>N</math> layers. In this model, the hidden vector in the <math>n</math>th layer, <math>{\boldsymbol h}^n</math>, is generated by the rule<br />
<br />
<math>{{\mathbf{h}}}^n_t = {\mathcal{H}}\left({{\mathbf{W}}}_{h^{n-1}h^{n}} {{\mathbf{h}}}^{n-1}_t +<br />
{{\mathbf{W}}}_{h^{n}h^{n}} {{\mathbf{h}}}^n_{t-1} + {{{\mathbf{b_{h}}}}}^n \right),</math><br />
<br />
where <math>{\boldsymbol h}^0 = {{\boldsymbol {{\mathbf{x}}}}}</math>. The final network output vector in the <math>t</math>th step of the output sequence, <math>{{\mathbf{y}}}_t</math>, is<br />
<br />
<math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{h^N y}} {{\mathbf{h}}}^N_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
This is pictured in the figure below for an arbitrary layer and time step.<br />
[[File:rnn_graves.png | frame | center |Fig 1. Schematic of a Recurrent Neural Network at an arbitrary layer and time step. ]]<br />
<br />
== Long Short-term Memory Architecture ==<br />
<br />
Graves et al. consider a Long Short-Term Memory (LSTM) architecture from Gers et al. . This model replaces <math>\mathcal{H}(\cdot)</math> by a composite function that incurs additional parameter matrices, and hence a higher dimensional model. Each neuron in the network (<span>i.e. </span> row of a parameter matrix <math>{{\mathbf{W}}}</math>) has an associated state vector <math>{{\mathbf{c}}}_t</math> at step <math>t</math>, which is a function of the previous <math>{{\mathbf{c}}}_{t-1}</math>, the input <math>{{\mathbf{x}}}_t</math> at step <math>t</math>, and the previous step’s hidden state <math>{{\mathbf{h}}}_{t-1}</math> as<br />
<br />
<math>{{\mathbf{c}}}_t = {{\mathbf{f}}}_t \circ {{\mathbf{c}}}_{t-1} + {{\mathbf{i}}}_t \circ \tanh<br />
\left({{{\mathbf{W}}}_{x {{\mathbf{c}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{c}}}}} {{\mathbf{h}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{c}}}\right)</math><br />
<br />
where <math>\circ</math> denotes the Hadamard product (elementwise vector multiplication), the vector <math>{{\mathbf{i}}}_t</math> denotes the so-called ''input'' vector to the cell that generated by the rule<br />
<br />
<math>{{\mathbf{i}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{i}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{i}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{i}}}\right),</math><br />
<br />
and <math>{{\mathbf{f}}}_t</math> is the ''forget gate'' vector, which is given by<br />
<br />
<math>{{\mathbf{f}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{f}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{f}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{f}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{f}}}\right)</math><br />
<br />
Each <math>{{\mathbf{W}}}</math> matrix and bias vector <math>{{\mathbf{b}}}</math> is a free parameter in the model and must be trained. Since <math>{{\mathbf{f}}}_t</math> multiplies the previous state <math>{{\mathbf{c}}}_{t-1}</math> in a Hadamard product with each element in the range <math>[0,1]</math>, it can be understood to reduce or dampen the effect of <math>{{\mathbf{c}}}_{t-1}</math> relative to the new input <math>{{\mathbf{i}}}_t</math>. The final hidden output state is then<br />
<br />
<math>{{\mathbf{h}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{o}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{o}}}}} {{\mathbf{h}}}_{t-1}<br />
+ {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{o}}}}} {{\mathbf{c}}}_{t} + {{\mathbf{b}}}_{{\mathbf{o}}}\right)\circ \tanh({{\mathbf{c}}}_t)</math><br />
<br />
In all of these equations, <math>\sigma</math> denotes the logistic sigmoid function. Note furthermore that <math>{{\mathbf{i}}}</math>, <math>{{\mathbf{f}}}</math>, <math>{{\mathbf{o}}}</math> and <math>{{\mathbf{c}}}</math> all of the same dimension as the hidden vector <math>h</math>. In addition, the weight matrices from the cell to gate vectors (<span>e.g. </span><math>{{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}</math>) are ''diagonal'', such that each parameter matrix is merely a scaling matrix.<br />
<br />
== Bidirectional RNNs ==<br />
<br />
A bidirectional RNN adds another layer of complexity by computing 2 hidden vectors per layer. Neglecting the <math>n</math> superscripts for the layer index, the ''forward'' hidden vector is determined through the conventional recursion as<br />
<br />
<math>{\overrightarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overrightarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}{\overrightarrow{{{\mathbf{h}}}}}}} {\overrightarrow{{{\mathbf{h}}}}}_{t-1} + {{\mathbf{b_{{\overrightarrow{{{\mathbf{h}}}}}}}}} \right),</math><br />
<br />
while the ''backward'' hidden state is determined recursively from the ''reversed'' sequence <math>({{\mathbf{x}}}_T,\ldots,{{\mathbf{x}}}_1)</math> as<br />
<br />
<math>{\overleftarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overleftarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}{\overleftarrow{{{\mathbf{h}}}}}}} {\overleftarrow{{{\mathbf{h}}}}}_{t+1} + {{\mathbf{b_{{\overleftarrow{{{\mathbf{h}}}}}}}}}\right).</math><br />
<br />
The final output for the single layer state is then an affine transformation of <math>{\overrightarrow{{{\mathbf{h}}}}}_t</math> and <math>{\overleftarrow{{{\mathbf{h}}}}}_t</math> as <math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}y}} {\overrightarrow{{{\mathbf{h}}}}}_t + {{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}y}} {\overleftarrow{{{\mathbf{h}}}}}_t<br />
+ {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The combination of LSTM with bidirectional dependence has been previously used by Graves and Schmidhuber, and is extended to multilayer networks in this paper. The motivation for this is to use dependencies on both prior and posterior vectors in the sequence to predict a given output at any time step. In other words, a forward and backward context is used.<br />
<br />
= Network Training for Phoneme Recognition =<br />
<br />
This section describes the phoneme classification experiments performed with the TIMIT corpus. An overview of the input timeseries audio data preprocessing into frequency domain vectors is given, and the optimization techniques are described.<br />
<br />
== Frequency Domain Processing ==<br />
<br />
Recall that for a real, periodic signal <math>{f(t)}</math>, the Fourier transform<br />
<br />
<math>{F(\omega)}= \int_{-\infty}^{\infty}e^{-i\omega t} {f(t)}dt</math><br />
<br />
can be represented for discrete samples <math>{f_0, f_1, \cdots<br />
f_{N-1}}</math> as<br />
<br />
<math>F_k\ = \sum_{n=0}^{N-1} x_n \cdot e^{-i \frac{k n<br />
\pi}{N}}</math>,<br />
<br />
where <math>{F_k}</math> are the discrete coefficients of the (amplitude) spectral distribution of the signal <math>f</math> in the frequency domain. This is a particularly powerful representation of audio data, since the modulation produced by the tongue and lips when shifting position while the larynx vibrates induces the frequency changes that make up a phonetic alphabet. A particular example is the spectrogram below. A spectrogram is a heat map representation of a matrix of data in which each pixel has intensity proportional to the magnitude of the matrix entry at that location. In this case, the matrix is a windowed Fourier transform of an audio signal spectrum such that the frequency of the audio signal as a function of time can be observed. This spectrogram shows the frequencies of my voice while singing the first bar of ''Hey Jude''; the bright pixels below 200 Hz show the base note, while the fainter lines at integer multiples of the base notes show resonant harmonics.<br />
<br />
[[File:spect.png | frame | center | Spectrogram of the first bar of Hey Jude, showing the frequency amplitude coefficients changing over time as the intensity of the pixels in the heat map.]]<br />
<br />
== Input Vector Format ==<br />
<br />
For each audio waveform in the TIMIT dataset, the Fourier coefficients were computed with a sliding Discrete Fourier Transform (DFT). The window duration used was 10 ms, corresponding to <math>n_s =<br />
80</math> samples per DFT since each waveform in the corpus was digitally registered with a sampling frequency of <math>f_s = 16</math> kHz, producing 40 unique coefficients at each timestep <math>t</math>, <math>\{c^{[t]}_k\}_{k=1}^{40}</math>. In addition, the first and second time derivatives of the coefficients between adjacent DFT windows were computed (the methodology is unspecified, however most likely this was performed with a numerical central difference technique). Thus, the input vector to the network at step <math>t</math> was the concatenated vector<br />
<br />
<math>{{\mathbf{x}}}_t = [c^{[t]}_1, \frac{d}{dt}c^{[t]}_1,<br />
\frac{d^2}{dt^2}c^{[t]}_1, c_2^{[t]}, \frac{d}{dt}c^{[t]}_2,<br />
\frac{d^2}{dt^2}c^{[t]}_2 \ldots]^T.</math><br />
<br />
Finally, an additional preprocessing step was performed: each input vector was normalized such that the dataset had zero mean and unit variance.<br />
<br />
== RNN Transducer ==<br />
<br />
When building a speech recognition classifier it is important to note that the length of the input and output sequences are of different lengths (sound data to phonemes). Additionally, RNNs require segmented input data. One approach to solve both these problems is to align the output (label) to the input (sound data), but more often than not an aligned dataset is not available. In this paper, the Connectionist Temporal Classification (CTC) method is used to create a probability distribution between inputs and output sequences. This is augmented with an RNN that predicts phonemes given the previous phonemes. The two predictions are then combined into a feed-forward network. The authors call this approach an RNN Transducer. From the distribution of the RNN and CTC, a maximum likelihood decoding for a given input can be computed to find the corresponding output label.<br />
<br />
<math>h(x) = \arg \max_{l \in L^{\leq T}} P(l | x)</math><br />
<br />
Where:<br />
<br />
* <math>h(x)</math>: classifier<br />
* <math>x</math>: input sequence<br />
* <math>l</math>: label<br />
* <math>L</math>: alphabet<br />
* <math>T</math>: maximum sequence length<br />
* <math>P(l | x)</math>: probability distribution of <math>l</math> given <math>x</math><br />
<br />
The value for <math>h(x)</math> cannot computed directly, it is approximated with methods such as Best Path, and Prefix Search Decoding, the authors has chosen to use a graph search algorithm called Beam Search.<br />
<br />
== Network Output Layer ==<br />
<br />
Two different network output layers were used, however most experimental results were reported for a simple softmax probability distribution vector over the set of <math>K =<br />
62</math> symbols, corresponding to the 61 phonemes in the corpus and an additional ''null'' symbol indicating that no phoneme distinct from the previous one was detected. This model is referred to as a Connectionist Temporal Classification (CTC) output function. The other (more complicated) output layer was not rigorously compared with a softmax output, and had nearly identical performance; this summary defers a description of this method, a so-called ''RNN transducer'' to the original paper.<br />
<br />
== Network Training Procedure ==<br />
<br />
The parameters in all ANNs were determined using Stochastic Gradient Descent with a fixed update step size (learning rate) of <math>10^{-4}</math> and a Nesterov momentum term of 0.9. The initial parameters were uniformly randomly drawn from <math>[-0.1,0.1]</math>. The optimization procedure was initially run with data instances from the standard 462 speaker training set of the TIMIT corpus. As a stopping criterion for the training, a secondary testing subset of 50 speakers was used on which the phoneme error rate (PER) was computed in each iteration of the optimization algorithm. The initial training phase for each network was halted once the PER stopped decreasing on the training set; using the parameters at this point as the initial weights, the optimization procedure was then re-run with Gaussian noise with zero mean and <math>\sigma = 0.075</math> added element-wise to the parameters in for each input vector instance <math>({{\mathbf{x}}}_1,\ldots, {{\mathbf{x}}}_T)</math> as a form of regularization. The second optimization procedure was again halted once the PER stopped decreasing on the testing dataset. Multiple trials in each of these numerical experiments were not performed, and as such, the variability in performance due to the initial values of the parameters in the optimization routine is unknown.<br />
<br />
= TIMIT Corpus Experiments &amp; Results =<br />
<br />
== Numerical Experiments ==<br />
<br />
To investigate the performance of the Bidirectional LSTM architecture as a function of depth, numerical experiments were conducted with networks with <math>N \in \{1,2,3,5\}</math> layers and 250 hidden units per layer. These are denoted in the paper by the network names CTC-<math>N</math>L-250H (where <math>N</math> is the layer depth), and are summarized with the number of free model parameters in the table below.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|}<br />
<br />
Additional experiments included: a 1-layer model with 3.8M weights, a 3-layer bidirectional ANN with <math>\tanh</math> activation functions rather than LSTM, a 3-layer unidirectional LSTM model with 3.8M weights (the same number of free parameters as the bidirectional 3-layer LSTM model). Finally, two experiments were performed with a bidirectional LSTM model with with 3 hidden layers each with 250 hidden units, and an RNN transducer output function. One of these experiments using uniformly randomly initialized parameters, and the other using the final (hidden) parameter weights from the CTC-3L-250H model as the initial paratemer values in the optimization algorithm. The names of these experiments are summarized below, where TRANS and PRETRANS denote the RNN transducer experiments initialized randomly, and using (pretrained) parameters from the CTC-3L-250H model, respectively. The suffices UNI and TANH denote the unidirectional and <math>\tanh</math> networks, respectively.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|-<br />
|PreTrans-3l-250h<br />
|4.3M<br />
|}<br />
<br />
== Results ==<br />
<br />
The percentage phoneme error rates and number of epochs in the SGD optimization procedure for the LSTM experiments on the TIMIT dataset with varying network depth are shown below. The PER can be seen to decrease monotonically, however there is negligible difference between 3 and 5 layers—it is possible that the 0.2% difference is within statistical fluctuations induced by the SGD optimization routine and initial parameter values. Note that the allocation of the epochs into either the initial training without noise or the second optimization routine with Gaussian noise added (or both) is unspecified in the paper.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|82<br />
|23.9%<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|55<br />
|21.0%<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|124<br />
|18.6%<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|150<br />
|18.4%<br />
|}<br />
<br />
The second set of PER results are shown below. The unidirectional LSTM architecture CTC-3L-421H-UNI achieves an error rate that is greater than the CTC-3L-250H model by 1 percentage point. No further comparative experiments between unidirectional and bidirectional models were given, however, and the margin of statistical uncertainty is unknown; thus the 1% (absolute) difference may or may not be significant. The TRANS-3L-250H model achieves a nearly identical PER to the CTC softmax model (0.3%) difference, however note that it has 0.5M ''more'' parameters due to the additional classification network at the output, and is hence not an entirely fair comparison since it has a greater dimensionality. The pretrained model PRETRANS-3L-250H also has 4.3M parameters and sees the best performance with a 17.5% error rate. Note that the difference in training of these two RNN transducer models is primarily in their initialization: the PRETRANS model was initialized using the trained weights of the CTC-3L-250H model (for the hidden layers). Thus, this difference in error rate of 0.6% is the direct result of a different starting iterates in the optimization procedure, which must be kept in mind when comparing between models.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|87<br />
|23.0%<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|107<br />
|37.6%<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|115<br />
|19.6%<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|112<br />
|18.3%<br />
|-<br />
|'''PreTrans-3l-250h'''<br />
|'''4.3M'''<br />
|'''144'''<br />
|'''17.7%'''<br />
|}<br />
<br />
= Further works =<br />
The first two authors developed the method to be able to readily be integrated into word-level language models <ref> Graves, A.; Jaitly, N.; Mohamed, A.-R, “Hybrid speech recognition with Deep Bidirectional LSTM," [http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6707742]</ref>. They used a hybrid approach where frame-level acoustic targets produced by a forced alignment given by a GMM-HMM system. <br />
<br />
= References =<br />
<br />
A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in <span>''Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on''</span>, pp. 6645–6649, IEEE, 2013.<br />
<br />
C. Lopes and F. Perdig<span>ã</span>o, “Phone recognition on the timit database,” <span>''Speech Technologies/Book''</span>, vol. 1, pp. 285–302, 2011.<br />
<br />
A.-r. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” <span>''Audio, Speech, and Language Processing, IEEE Transactions on''</span>, vol. 20, no. 1, pp. 14–22, 2012.<br />
<br />
F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, “Learning precise timing with lstm recurrent networks,” <span>''The Journal of Machine Learning Research''</span>, vol. 3, pp. 115–143, 2003.<br />
<br />
A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm networks,” in <span>''Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on''</span>, vol. 4, pp. 2047–2052, IEEE, 2005.</div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=graves_et_al.,_Speech_recognition_with_deep_recurrent_neural_networks&diff=26117graves et al., Speech recognition with deep recurrent neural networks2015-11-11T21:10:30Z<p>Mgohari2: </p>
<hr />
<div>= Overview =<br />
<br />
This document is a summary of the paper ''Speech recognition with deep recurrent neural networks'' by A. Graves, A.-R. Mohammed, and G. Hinton, which appeared in the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). The first and third authors are Artificial Neural Network (ANN) researchers, while Mohammed works in the field of automatic speech recognition.<br />
<br />
The paper presents the application of bidirectional multilayer Long Short-term Memory (LSTM) ANNs with 1–5 layers to phoneme recognition on the TIMIT acoustic phonme corpus, which is the standard benchmark in the field of acoustic recognition, extending the previous work by Mohammed and Hinton on this topic using Deep Belief Networks. The TIMIT corpus contains audio recordings of 6300 sentences spoken by 630 (American) English speakers from 8 regions with distinct dialects, where each recording has accompanying manually labelled transcriptions of the phonemes in the audio clips alongside timestamp information. The empirical classification accuracies reported in the literature before the publication of this paper are shown in the timeline below (note that in this figure, the accuracy metric is 100% - PER, where PER is the phoneme classification error rate).<br />
<br />
The deep LSTM networks presented with 3 or more layers obtain phoneme classification error rates of 19.6% or less, with one model obtaining 17.7%, which was the best result reported in the literature at the time, outperforming the previous record of 20.7% achieved by Mohammed et al. Furthermore, the error rate decreases monotonically with LSTM network depth for 1–5 layers. While the bidirectional LSTM model performs well on the TIMIT corpus, any potential advantage of bidirectional over unidirectional LSTM network models, cannot be determined from this paper since the performance comparison is across different numbers of iterations taken in the optimization algorithm used to train the models, and multiple trials for statistical validity were not performed.<br />
<br />
<br />
[[File:timit.png | frame | center |Timeline of percentage phoneme recognition accuracy achieved on the core TIMIT corpus, from Lopes and Perdigao, 2011. ]]<br />
<br />
== Motivation ==<br />
Neural networks have been trained for speech recognition problems, however usually in combination with hidden Markov Models. The authors in this paper argue that given the nature of speech is an inherently dynamic process RNN should be the ideal choice for such a problem. There has been attempts to train RNNs for speech recognition <ref>A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Un-segmented Sequence Data with Recurrent Neural Networks,” in ICML, Pittsburgh, USA, 2006.</ref> <ref> A. Graves, Supervised sequence labelling with recurrentneural networks, vol. 385, Springer, 2012.</ref> <ref> A. Graves, “Sequence transduction with recurrent neural networks,” in ICML Representation Learning Work-sop, 2012.</ref> and RNNs with LSTM for recognizing cursive handwriting <ref> A. Graves, S. Fernandez, M. Liwicki, H. Bunke, and J. Schmidhuber, “Unconstrained Online Handwriting Recognition with Recurrent Neural Networks,” in NIPS.2008.</ref> but neither has made an impact on the speech recognition. The authors drew inspiration from Convolutional Neural Networks, where multiple layers are stacked on top of each other to combine LSTM and RNNs together.<br />
<br />
However instead of using a conventional RNN which only considers previous contexts, a Bidirectional RNN <ref> M. Schuster and K. K. Paliwal, “Bidirectional Recurrent Neural Networks,”IEEE Transactions on Signal Processing, vol. 45, pp. 2673–2681, 1997.</ref> was used to consider both forward and backward contexts. This is due in part because the authors saw no reason not to exploit future contexts since the speech utterances are transcribed at once. Additionally BRNN has the added benefit of being able to consider the entire forward and context, not just some predefined window of forward and backward contexts.<br />
<br />
[[File:brnn.png|center|600px]]<br />
<br />
= Deep RNN models considered by Graves et al. =<br />
<br />
In this paper, Graves et al. use deep LSTM network models. We briefly review recurrent neural networks, which form the basis of the more complicated LSTM network that has composite <math>\mathcal{H}</math> functions instead of sigmoids and additional parameter vectors associated with the ''state'' of each neuron. Finally, a description of ''bidirectional'' ANNs is given, which is used throughout the numerical experiments.<br />
<br />
== Recurrent Neural Networks ==<br />
<br />
Recall that a standard 1-layer recurrent neural network (RNN) computes the hidden vector sequence <math>{\boldsymbol h} = ({{\mathbf{h}}}_1,\ldots,{{\mathbf{h}}}_T)</math> and output vector sequence <math>{{\boldsymbol {{\mathbf{y}}}}}= ({{\mathbf{y}}}_1,\ldots,{{\mathbf{y}}}_T)</math> from an input vector sequence <math>{{\boldsymbol {{\mathbf{x}}}}}= ({{\mathbf{x}}}_1,\ldots,{{\mathbf{x}}}_T)</math> through the following equation where the index is from <math>t=1</math> to <math>T</math>:<br />
<br />
<math>{{\mathbf{h}}}_t = \begin{cases}<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{\mathbf{b_{h}}}}}\right) &\quad t = 1\\<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{{\mathbf{W}}}_{h h}}}{{\mathbf{h}}}_{t-1} + {{{\mathbf{b_{h}}}}}\right) &\quad \text{else}<br />
\end{cases}</math><br />
<br />
and<br />
<br />
<math>{{\mathbf{y}}}_t = {{{{\mathbf{W}}}_{h y}}}{{\mathbf{h}}}_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The <math>{{\mathbf{W}}}</math> terms are the parameter matrices with subscripts denoting the layer location (<span>e.g. </span><math>{{{{\mathbf{W}}}_{x h}}}</math> is the input-hidden weight matrix), and the offset <math>b</math> terms are bias vectors with appropriate subscripts (<span>e.g. </span><math>{{{\mathbf{b_{h}}}}}</math> is hidden bias vector). The function <math>{\mathcal{H}}</math> is an elementwise vector function with a range of <math>[0,1]</math> for each component in the hidden layer.<br />
<br />
This paper considers multilayer RNN architectures, with the same hidden layer function used for all <math>N</math> layers. In this model, the hidden vector in the <math>n</math>th layer, <math>{\boldsymbol h}^n</math>, is generated by the rule<br />
<br />
<math>{{\mathbf{h}}}^n_t = {\mathcal{H}}\left({{\mathbf{W}}}_{h^{n-1}h^{n}} {{\mathbf{h}}}^{n-1}_t +<br />
{{\mathbf{W}}}_{h^{n}h^{n}} {{\mathbf{h}}}^n_{t-1} + {{{\mathbf{b_{h}}}}}^n \right),</math><br />
<br />
where <math>{\boldsymbol h}^0 = {{\boldsymbol {{\mathbf{x}}}}}</math>. The final network output vector in the <math>t</math>th step of the output sequence, <math>{{\mathbf{y}}}_t</math>, is<br />
<br />
<math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{h^N y}} {{\mathbf{h}}}^N_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
This is pictured in the figure below for an arbitrary layer and time step.<br />
[[File:rnn_graves.png | frame | center |Fig 1. Schematic of a Recurrent Neural Network at an arbitrary layer and time step. ]]<br />
<br />
== Long Short-term Memory Architecture ==<br />
<br />
Graves et al. consider a Long Short-Term Memory (LSTM) architecture from Gers et al. . This model replaces <math>\mathcal{H}(\cdot)</math> by a composite function that incurs additional parameter matrices, and hence a higher dimensional model. Each neuron in the network (<span>i.e. </span> row of a parameter matrix <math>{{\mathbf{W}}}</math>) has an associated state vector <math>{{\mathbf{c}}}_t</math> at step <math>t</math>, which is a function of the previous <math>{{\mathbf{c}}}_{t-1}</math>, the input <math>{{\mathbf{x}}}_t</math> at step <math>t</math>, and the previous step’s hidden state <math>{{\mathbf{h}}}_{t-1}</math> as<br />
<br />
<math>{{\mathbf{c}}}_t = {{\mathbf{f}}}_t \circ {{\mathbf{c}}}_{t-1} + {{\mathbf{i}}}_t \circ \tanh<br />
\left({{{\mathbf{W}}}_{x {{\mathbf{c}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{c}}}}} {{\mathbf{h}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{c}}}\right)</math><br />
<br />
where <math>\circ</math> denotes the Hadamard product (elementwise vector multiplication), the vector <math>{{\mathbf{i}}}_t</math> denotes the so-called ''input'' vector to the cell that generated by the rule<br />
<br />
<math>{{\mathbf{i}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{i}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{i}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{i}}}\right),</math><br />
<br />
and <math>{{\mathbf{f}}}_t</math> is the ''forget gate'' vector, which is given by<br />
<br />
<math>{{\mathbf{f}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{f}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{f}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{f}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{f}}}\right)</math><br />
<br />
Each <math>{{\mathbf{W}}}</math> matrix and bias vector <math>{{\mathbf{b}}}</math> is a free parameter in the model and must be trained. Since <math>{{\mathbf{f}}}_t</math> multiplies the previous state <math>{{\mathbf{c}}}_{t-1}</math> in a Hadamard product with each element in the range <math>[0,1]</math>, it can be understood to reduce or dampen the effect of <math>{{\mathbf{c}}}_{t-1}</math> relative to the new input <math>{{\mathbf{i}}}_t</math>. The final hidden output state is then<br />
<br />
<math>{{\mathbf{h}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{o}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{o}}}}} {{\mathbf{h}}}_{t-1}<br />
+ {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{o}}}}} {{\mathbf{c}}}_{t} + {{\mathbf{b}}}_{{\mathbf{o}}}\right)\circ \tanh({{\mathbf{c}}}_t)</math><br />
<br />
In all of these equations, <math>\sigma</math> denotes the logistic sigmoid function. Note furthermore that <math>{{\mathbf{i}}}</math>, <math>{{\mathbf{f}}}</math>, <math>{{\mathbf{o}}}</math> and <math>{{\mathbf{c}}}</math> all of the same dimension as the hidden vector <math>h</math>. In addition, the weight matrices from the cell to gate vectors (<span>e.g. </span><math>{{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}</math>) are ''diagonal'', such that each parameter matrix is merely a scaling matrix.<br />
<br />
== Bidirectional RNNs ==<br />
<br />
A bidirectional RNN adds another layer of complexity by computing 2 hidden vectors per layer. Neglecting the <math>n</math> superscripts for the layer index, the ''forward'' hidden vector is determined through the conventional recursion as<br />
<br />
<math>{\overrightarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overrightarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}{\overrightarrow{{{\mathbf{h}}}}}}} {\overrightarrow{{{\mathbf{h}}}}}_{t-1} + {{\mathbf{b_{{\overrightarrow{{{\mathbf{h}}}}}}}}} \right),</math><br />
<br />
while the ''backward'' hidden state is determined recursively from the ''reversed'' sequence <math>({{\mathbf{x}}}_T,\ldots,{{\mathbf{x}}}_1)</math> as<br />
<br />
<math>{\overleftarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overleftarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}{\overleftarrow{{{\mathbf{h}}}}}}} {\overleftarrow{{{\mathbf{h}}}}}_{t+1} + {{\mathbf{b_{{\overleftarrow{{{\mathbf{h}}}}}}}}}\right).</math><br />
<br />
The final output for the single layer state is then an affine transformation of <math>{\overrightarrow{{{\mathbf{h}}}}}_t</math> and <math>{\overleftarrow{{{\mathbf{h}}}}}_t</math> as <math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}y}} {\overrightarrow{{{\mathbf{h}}}}}_t + {{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}y}} {\overleftarrow{{{\mathbf{h}}}}}_t<br />
+ {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The combination of LSTM with bidirectional dependence has been previously used by Graves and Schmidhuber, and is extended to multilayer networks in this paper. The motivation for this is to use dependencies on both prior and posterior vectors in the sequence to predict a given output at any time step. In other words, a forward and backward context is used.<br />
<br />
= Network Training for Phoneme Recognition =<br />
<br />
This section describes the phoneme classification experiments performed with the TIMIT corpus. An overview of the input timeseries audio data preprocessing into frequency domain vectors is given, and the optimization techniques are described.<br />
<br />
== Frequency Domain Processing ==<br />
<br />
Recall that for a real, periodic signal <math>{f(t)}</math>, the Fourier transform<br />
<br />
<math>{F(\omega)}= \int_{-\infty}^{\infty}e^{-i\omega t} {f(t)}dt</math><br />
<br />
can be represented for discrete samples <math>{f_0, f_1, \cdots<br />
f_{N-1}}</math> as<br />
<br />
<math>F_k\ = \sum_{n=0}^{N-1} x_n \cdot e^{-i \frac{k n<br />
\pi}{N}}</math>,<br />
<br />
where <math>{F_k}</math> are the discrete coefficients of the (amplitude) spectral distribution of the signal <math>f</math> in the frequency domain. This is a particularly powerful representation of audio data, since the modulation produced by the tongue and lips when shifting position while the larynx vibrates induces the frequency changes that make up a phonetic alphabet. A particular example is the spectrogram below. A spectrogram is a heat map representation of a matrix of data in which each pixel has intensity proportional to the magnitude of the matrix entry at that location. In this case, the matrix is a windowed Fourier transform of an audio signal spectrum such that the frequency of the audio signal as a function of time can be observed. This spectrogram shows the frequencies of my voice while singing the first bar of ''Hey Jude''; the bright pixels below 200 Hz show the base note, while the fainter lines at integer multiples of the base notes show resonant harmonics.<br />
<br />
[[File:spect.png | frame | center | Spectrogram of the first bar of Hey Jude, showing the frequency amplitude coefficients changing over time as the intensity of the pixels in the heat map.]]<br />
<br />
== Input Vector Format ==<br />
<br />
For each audio waveform in the TIMIT dataset, the Fourier coefficients were computed with a sliding Discrete Fourier Transform (DFT). The window duration used was 10 ms, corresponding to <math>n_s =<br />
80</math> samples per DFT since each waveform in the corpus was digitally registered with a sampling frequency of <math>f_s = 16</math> kHz, producing 40 unique coefficients at each timestep <math>t</math>, <math>\{c^{[t]}_k\}_{k=1}^{40}</math>. In addition, the first and second time derivatives of the coefficients between adjacent DFT windows were computed (the methodology is unspecified, however most likely this was performed with a numerical central difference technique). Thus, the input vector to the network at step <math>t</math> was the concatenated vector<br />
<br />
<math>{{\mathbf{x}}}_t = [c^{[t]}_1, \frac{d}{dt}c^{[t]}_1,<br />
\frac{d^2}{dt^2}c^{[t]}_1, c_2^{[t]}, \frac{d}{dt}c^{[t]}_2,<br />
\frac{d^2}{dt^2}c^{[t]}_2 \ldots]^T.</math><br />
<br />
Finally, an additional preprocessing step was performed: each input vector was normalized such that the dataset had zero mean and unit variance.<br />
<br />
== RNN Transducer ==<br />
<br />
When building a speech recognition classifier it is important to note that the length of the input and output sequences are of different lengths (sound data to phonemes). Additionally, RNNs require segmented input data. One approach to solve both these problems is to align the output (label) to the input (sound data), but more often than not an aligned dataset is not available. In this paper, the Connectionist Temporal Classification (CTC) method is used to create a probability distribution between inputs and output sequences. This is augmented with an RNN that predicts phonemes given the previous phonemes. The two predictions are then combined into a feed-forward network. The authors call this approach an RNN Transducer. From the distribution of the RNN and CTC, a maximum likelihood decoding for a given input can be computed to find the corresponding output label.<br />
<br />
<math>h(x) = \arg \max_{l \in L^{\leq T}} P(l | x)</math><br />
<br />
Where:<br />
<br />
* <math>h(x)</math>: classifier<br />
* <math>x</math>: input sequence<br />
* <math>l</math>: label<br />
* <math>L</math>: alphabet<br />
* <math>T</math>: maximum sequence length<br />
* <math>P(l | x)</math>: probability distribution of <math>l</math> given <math>x</math><br />
<br />
The value for <math>h(x)</math> cannot computed directly, it is approximated with methods such as Best Path, and Prefix Search Decoding, the authors has chosen to use a graph search algorithm called Beam Search.<br />
<br />
== Network Output Layer ==<br />
<br />
Two different network output layers were used, however most experimental results were reported for a simple softmax probability distribution vector over the set of <math>K =<br />
62</math> symbols, corresponding to the 61 phonemes in the corpus and an additional ''null'' symbol indicating that no phoneme distinct from the previous one was detected. This model is referred to as a Connectionist Temporal Classification (CTC) output function. The other (more complicated) output layer was not rigorously compared with a softmax output, and had nearly identical performance; this summary defers a description of this method, a so-called ''RNN transducer'' to the original paper.<br />
<br />
== Network Training Procedure ==<br />
<br />
The parameters in all ANNs were determined using Stochastic Gradient Descent with a fixed update step size (learning rate) of <math>10^{-4}</math> and a Nesterov momentum term of 0.9. The initial parameters were uniformly randomly drawn from <math>[-0.1,0.1]</math>. The optimization procedure was initially run with data instances from the standard 462 speaker training set of the TIMIT corpus. As a stopping criterion for the training, a secondary testing subset of 50 speakers was used on which the phoneme error rate (PER) was computed in each iteration of the optimization algorithm. The initial training phase for each network was halted once the PER stopped decreasing on the training set; using the parameters at this point as the initial weights, the optimization procedure was then re-run with Gaussian noise with zero mean and <math>\sigma = 0.075</math> added element-wise to the parameters in for each input vector instance <math>({{\mathbf{x}}}_1,\ldots, {{\mathbf{x}}}_T)</math> as a form of regularization. The second optimization procedure was again halted once the PER stopped decreasing on the testing dataset. Multiple trials in each of these numerical experiments were not performed, and as such, the variability in performance due to the initial values of the parameters in the optimization routine is unknown.<br />
<br />
= TIMIT Corpus Experiments &amp; Results =<br />
<br />
== Numerical Experiments ==<br />
<br />
To investigate the performance of the Bidirectional LSTM architecture as a function of depth, numerical experiments were conducted with networks with <math>N \in \{1,2,3,5\}</math> layers and 250 hidden units per layer. These are denoted in the paper by the network names CTC-<math>N</math>L-250H (where <math>N</math> is the layer depth), and are summarized with the number of free model parameters in the table below.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|}<br />
<br />
Additional experiments included: a 1-layer model with 3.8M weights, a 3-layer bidirectional ANN with <math>\tanh</math> activation functions rather than LSTM, a 3-layer unidirectional LSTM model with 3.8M weights (the same number of free parameters as the bidirectional 3-layer LSTM model). Finally, two experiments were performed with a bidirectional LSTM model with with 3 hidden layers each with 250 hidden units, and an RNN transducer output function. One of these experiments using uniformly randomly initialized parameters, and the other using the final (hidden) parameter weights from the CTC-3L-250H model as the initial paratemer values in the optimization algorithm. The names of these experiments are summarized below, where TRANS and PRETRANS denote the RNN transducer experiments initialized randomly, and using (pretrained) parameters from the CTC-3L-250H model, respectively. The suffices UNI and TANH denote the unidirectional and <math>\tanh</math> networks, respectively.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|-<br />
|PreTrans-3l-250h<br />
|4.3M<br />
|}<br />
<br />
== Results ==<br />
<br />
The percentage phoneme error rates and number of epochs in the SGD optimization procedure for the LSTM experiments on the TIMIT dataset with varying network depth are shown below. The PER can be seen to decrease monotonically, however there is negligible difference between 3 and 5 layers—it is possible that the 0.2% difference is within statistical fluctuations induced by the SGD optimization routine and initial parameter values. Note that the allocation of the epochs into either the initial training without noise or the second optimization routine with Gaussian noise added (or both) is unspecified in the paper.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|82<br />
|23.9%<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|55<br />
|21.0%<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|124<br />
|18.6%<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|150<br />
|18.4%<br />
|}<br />
<br />
The second set of PER results are shown below. The unidirectional LSTM architecture CTC-3L-421H-UNI achieves an error rate that is greater than the CTC-3L-250H model by 1 percentage point. No further comparative experiments between unidirectional and bidirectional models were given, however, and the margin of statistical uncertainty is unknown; thus the 1% (absolute) difference may or may not be significant. The TRANS-3L-250H model achieves a nearly identical PER to the CTC softmax model (0.3%) difference, however note that it has 0.5M ''more'' parameters due to the additional classification network at the output, and is hence not an entirely fair comparison since it has a greater dimensionality. The pretrained model PRETRANS-3L-250H also has 4.3M parameters and sees the best performance with a 17.5% error rate. Note that the difference in training of these two RNN transducer models is primarily in their initialization: the PRETRANS model was initialized using the trained weights of the CTC-3L-250H model (for the hidden layers). Thus, this difference in error rate of 0.6% is the direct result of a different starting iterates in the optimization procedure, which must be kept in mind when comparing between models.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|87<br />
|23.0%<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|107<br />
|37.6%<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|115<br />
|19.6%<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|112<br />
|18.3%<br />
|-<br />
|'''PreTrans-3l-250h'''<br />
|'''4.3M'''<br />
|'''144'''<br />
|'''17.7%'''<br />
|}<br />
<br />
= Further works =<br />
The first two authors developed the method to be able to readily be integrated into word-level language models <ref> Graves, A.; Jaitly, N.; Mohamed, A.-R, “Hybrid speech recognition with Deep Bidirectional LSTM," [http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6707742]</ref>. They used a hybrid approach where frame-level acoustic targets produced by a forced alignment given by a GMM-HMM system. <br />
<br />
<ref>A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Un-segmented Sequence Data with Recurrent Neural Networks,” in ICML, Pittsburgh, USA, 2006.</ref><br />
= References =<br />
<br />
A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in <span>''Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on''</span>, pp. 6645–6649, IEEE, 2013.<br />
<br />
C. Lopes and F. Perdig<span>ã</span>o, “Phone recognition on the timit database,” <span>''Speech Technologies/Book''</span>, vol. 1, pp. 285–302, 2011.<br />
<br />
A.-r. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” <span>''Audio, Speech, and Language Processing, IEEE Transactions on''</span>, vol. 20, no. 1, pp. 14–22, 2012.<br />
<br />
F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, “Learning precise timing with lstm recurrent networks,” <span>''The Journal of Machine Learning Research''</span>, vol. 3, pp. 115–143, 2003.<br />
<br />
A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm networks,” in <span>''Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on''</span>, vol. 4, pp. 2047–2052, IEEE, 2005.</div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=graves_et_al.,_Speech_recognition_with_deep_recurrent_neural_networks&diff=26116graves et al., Speech recognition with deep recurrent neural networks2015-11-11T21:07:20Z<p>Mgohari2: </p>
<hr />
<div>= Overview =<br />
<br />
This document is a summary of the paper ''Speech recognition with deep recurrent neural networks'' by A. Graves, A.-R. Mohammed, and G. Hinton, which appeared in the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). The first and third authors are Artificial Neural Network (ANN) researchers, while Mohammed works in the field of automatic speech recognition.<br />
<br />
The paper presents the application of bidirectional multilayer Long Short-term Memory (LSTM) ANNs with 1–5 layers to phoneme recognition on the TIMIT acoustic phonme corpus, which is the standard benchmark in the field of acoustic recognition, extending the previous work by Mohammed and Hinton on this topic using Deep Belief Networks. The TIMIT corpus contains audio recordings of 6300 sentences spoken by 630 (American) English speakers from 8 regions with distinct dialects, where each recording has accompanying manually labelled transcriptions of the phonemes in the audio clips alongside timestamp information. The empirical classification accuracies reported in the literature before the publication of this paper are shown in the timeline below (note that in this figure, the accuracy metric is 100% - PER, where PER is the phoneme classification error rate).<br />
<br />
The deep LSTM networks presented with 3 or more layers obtain phoneme classification error rates of 19.6% or less, with one model obtaining 17.7%, which was the best result reported in the literature at the time, outperforming the previous record of 20.7% achieved by Mohammed et al. Furthermore, the error rate decreases monotonically with LSTM network depth for 1–5 layers. While the bidirectional LSTM model performs well on the TIMIT corpus, any potential advantage of bidirectional over unidirectional LSTM network models, cannot be determined from this paper since the performance comparison is across different numbers of iterations taken in the optimization algorithm used to train the models, and multiple trials for statistical validity were not performed.<br />
<br />
<br />
[[File:timit.png | frame | center |Timeline of percentage phoneme recognition accuracy achieved on the core TIMIT corpus, from Lopes and Perdigao, 2011. ]]<br />
<br />
== Motivation ==<br />
Neural networks have been trained for speech recognition problems, however usually in combination with hidden Markov Models. The authors in this paper argue that given the nature of speech is an inherently dynamic process RNN should be the ideal choice for such a problem. There has been attempts to train RNNs for speech recognition <ref>A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Un-segmented Sequence Data with Recurrent Neural Networks,” in ICML, Pittsburgh, USA, 2006.</ref> <ref> A. Graves, Supervised sequence labelling with recurrentneural networks, vol. 385, Springer, 2012.</ref> <ref> A. Graves, “Sequence transduction with recurrent neural networks,” in ICML Representation Learning Work-sop, 2012.</ref> and RNNs with LSTM for recognizing cursive handwriting <ref> A. Graves, S. Fernandez, M. Liwicki, H. Bunke, and J. Schmidhuber, “Unconstrained Online Handwriting Recognition with Recurrent Neural Networks,” in NIPS.2008.</ref> but neither has made an impact on the speech recognition. The authors drew inspiration from Convolutional Neural Networks, where multiple layers are stacked on top of each other to combine LSTM and RNNs together.<br />
<br />
However instead of using a conventional RNN which only considers previous contexts, a Bidirectional RNN <ref> M. Schuster and K. K. Paliwal, “Bidirectional Recurrent Neural Networks,”IEEE Transactions on Signal Processing, vol. 45, pp. 2673–2681, 1997.</ref> was used to consider both forward and backward contexts. This is due in part because the authors saw no reason not to exploit future contexts since the speech utterances are transcribed at once. Additionally BRNN has the added benefit of being able to consider the entire forward and context, not just some predefined window of forward and backward contexts.<br />
<br />
[[File:brnn.png|center|600px]]<br />
<br />
= Deep RNN models considered by Graves et al. =<br />
<br />
In this paper, Graves et al. use deep LSTM network models. We briefly review recurrent neural networks, which form the basis of the more complicated LSTM network that has composite <math>\mathcal{H}</math> functions instead of sigmoids and additional parameter vectors associated with the ''state'' of each neuron. Finally, a description of ''bidirectional'' ANNs is given, which is used throughout the numerical experiments.<br />
<br />
== Recurrent Neural Networks ==<br />
<br />
Recall that a standard 1-layer recurrent neural network (RNN) computes the hidden vector sequence <math>{\boldsymbol h} = ({{\mathbf{h}}}_1,\ldots,{{\mathbf{h}}}_T)</math> and output vector sequence <math>{{\boldsymbol {{\mathbf{y}}}}}= ({{\mathbf{y}}}_1,\ldots,{{\mathbf{y}}}_T)</math> from an input vector sequence <math>{{\boldsymbol {{\mathbf{x}}}}}= ({{\mathbf{x}}}_1,\ldots,{{\mathbf{x}}}_T)</math> through the following equation where the index is from <math>t=1</math> to <math>T</math>:<br />
<br />
<math>{{\mathbf{h}}}_t = \begin{cases}<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{\mathbf{b_{h}}}}}\right) &\quad t = 1\\<br />
{\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{{\mathbf{W}}}_{h h}}}{{\mathbf{h}}}_{t-1} + {{{\mathbf{b_{h}}}}}\right) &\quad \text{else}<br />
\end{cases}</math><br />
<br />
and<br />
<br />
<math>{{\mathbf{y}}}_t = {{{{\mathbf{W}}}_{h y}}}{{\mathbf{h}}}_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The <math>{{\mathbf{W}}}</math> terms are the parameter matrices with subscripts denoting the layer location (<span>e.g. </span><math>{{{{\mathbf{W}}}_{x h}}}</math> is the input-hidden weight matrix), and the offset <math>b</math> terms are bias vectors with appropriate subscripts (<span>e.g. </span><math>{{{\mathbf{b_{h}}}}}</math> is hidden bias vector). The function <math>{\mathcal{H}}</math> is an elementwise vector function with a range of <math>[0,1]</math> for each component in the hidden layer.<br />
<br />
This paper considers multilayer RNN architectures, with the same hidden layer function used for all <math>N</math> layers. In this model, the hidden vector in the <math>n</math>th layer, <math>{\boldsymbol h}^n</math>, is generated by the rule<br />
<br />
<math>{{\mathbf{h}}}^n_t = {\mathcal{H}}\left({{\mathbf{W}}}_{h^{n-1}h^{n}} {{\mathbf{h}}}^{n-1}_t +<br />
{{\mathbf{W}}}_{h^{n}h^{n}} {{\mathbf{h}}}^n_{t-1} + {{{\mathbf{b_{h}}}}}^n \right),</math><br />
<br />
where <math>{\boldsymbol h}^0 = {{\boldsymbol {{\mathbf{x}}}}}</math>. The final network output vector in the <math>t</math>th step of the output sequence, <math>{{\mathbf{y}}}_t</math>, is<br />
<br />
<math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{h^N y}} {{\mathbf{h}}}^N_t + {{{\mathbf{b_{y}}}}}.</math><br />
<br />
This is pictured in the figure below for an arbitrary layer and time step.<br />
[[File:rnn_graves.png | frame | center |Fig 1. Schematic of a Recurrent Neural Network at an arbitrary layer and time step. ]]<br />
<br />
== Long Short-term Memory Architecture ==<br />
<br />
Graves et al. consider a Long Short-Term Memory (LSTM) architecture from Gers et al. . This model replaces <math>\mathcal{H}(\cdot)</math> by a composite function that incurs additional parameter matrices, and hence a higher dimensional model. Each neuron in the network (<span>i.e. </span> row of a parameter matrix <math>{{\mathbf{W}}}</math>) has an associated state vector <math>{{\mathbf{c}}}_t</math> at step <math>t</math>, which is a function of the previous <math>{{\mathbf{c}}}_{t-1}</math>, the input <math>{{\mathbf{x}}}_t</math> at step <math>t</math>, and the previous step’s hidden state <math>{{\mathbf{h}}}_{t-1}</math> as<br />
<br />
<math>{{\mathbf{c}}}_t = {{\mathbf{f}}}_t \circ {{\mathbf{c}}}_{t-1} + {{\mathbf{i}}}_t \circ \tanh<br />
\left({{{\mathbf{W}}}_{x {{\mathbf{c}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{c}}}}} {{\mathbf{h}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{c}}}\right)</math><br />
<br />
where <math>\circ</math> denotes the Hadamard product (elementwise vector multiplication), the vector <math>{{\mathbf{i}}}_t</math> denotes the so-called ''input'' vector to the cell that generated by the rule<br />
<br />
<math>{{\mathbf{i}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{i}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{i}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{i}}}\right),</math><br />
<br />
and <math>{{\mathbf{f}}}_t</math> is the ''forget gate'' vector, which is given by<br />
<br />
<math>{{\mathbf{f}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{f}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{h {{\mathbf{f}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{f}}}}}<br />
{{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{f}}}\right)</math><br />
<br />
Each <math>{{\mathbf{W}}}</math> matrix and bias vector <math>{{\mathbf{b}}}</math> is a free parameter in the model and must be trained. Since <math>{{\mathbf{f}}}_t</math> multiplies the previous state <math>{{\mathbf{c}}}_{t-1}</math> in a Hadamard product with each element in the range <math>[0,1]</math>, it can be understood to reduce or dampen the effect of <math>{{\mathbf{c}}}_{t-1}</math> relative to the new input <math>{{\mathbf{i}}}_t</math>. The final hidden output state is then<br />
<br />
<math>{{\mathbf{h}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{o}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{o}}}}} {{\mathbf{h}}}_{t-1}<br />
+ {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{o}}}}} {{\mathbf{c}}}_{t} + {{\mathbf{b}}}_{{\mathbf{o}}}\right)\circ \tanh({{\mathbf{c}}}_t)</math><br />
<br />
In all of these equations, <math>\sigma</math> denotes the logistic sigmoid function. Note furthermore that <math>{{\mathbf{i}}}</math>, <math>{{\mathbf{f}}}</math>, <math>{{\mathbf{o}}}</math> and <math>{{\mathbf{c}}}</math> all of the same dimension as the hidden vector <math>h</math>. In addition, the weight matrices from the cell to gate vectors (<span>e.g. </span><math>{{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}</math>) are ''diagonal'', such that each parameter matrix is merely a scaling matrix.<br />
<br />
== Bidirectional RNNs ==<br />
<br />
A bidirectional RNN adds another layer of complexity by computing 2 hidden vectors per layer. Neglecting the <math>n</math> superscripts for the layer index, the ''forward'' hidden vector is determined through the conventional recursion as<br />
<br />
<math>{\overrightarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overrightarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}{\overrightarrow{{{\mathbf{h}}}}}}} {\overrightarrow{{{\mathbf{h}}}}}_{t-1} + {{\mathbf{b_{{\overrightarrow{{{\mathbf{h}}}}}}}}} \right),</math><br />
<br />
while the ''backward'' hidden state is determined recursively from the ''reversed'' sequence <math>({{\mathbf{x}}}_T,\ldots,{{\mathbf{x}}}_1)</math> as<br />
<br />
<math>{\overleftarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overleftarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +<br />
{{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}{\overleftarrow{{{\mathbf{h}}}}}}} {\overleftarrow{{{\mathbf{h}}}}}_{t+1} + {{\mathbf{b_{{\overleftarrow{{{\mathbf{h}}}}}}}}}\right).</math><br />
<br />
The final output for the single layer state is then an affine transformation of <math>{\overrightarrow{{{\mathbf{h}}}}}_t</math> and <math>{\overleftarrow{{{\mathbf{h}}}}}_t</math> as <math>{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}y}} {\overrightarrow{{{\mathbf{h}}}}}_t + {{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}y}} {\overleftarrow{{{\mathbf{h}}}}}_t<br />
+ {{{\mathbf{b_{y}}}}}.</math><br />
<br />
The combination of LSTM with bidirectional dependence has been previously used by Graves and Schmidhuber, and is extended to multilayer networks in this paper. The motivation for this is to use dependencies on both prior and posterior vectors in the sequence to predict a given output at any time step. In other words, a forward and backward context is used.<br />
<br />
= Network Training for Phoneme Recognition =<br />
<br />
This section describes the phoneme classification experiments performed with the TIMIT corpus. An overview of the input timeseries audio data preprocessing into frequency domain vectors is given, and the optimization techniques are described.<br />
<br />
== Frequency Domain Processing ==<br />
<br />
Recall that for a real, periodic signal <math>{f(t)}</math>, the Fourier transform<br />
<br />
<math>{F(\omega)}= \int_{-\infty}^{\infty}e^{-i\omega t} {f(t)}dt</math><br />
<br />
can be represented for discrete samples <math>{f_0, f_1, \cdots<br />
f_{N-1}}</math> as<br />
<br />
<math>F_k\ = \sum_{n=0}^{N-1} x_n \cdot e^{-i \frac{k n<br />
\pi}{N}}</math>,<br />
<br />
where <math>{F_k}</math> are the discrete coefficients of the (amplitude) spectral distribution of the signal <math>f</math> in the frequency domain. This is a particularly powerful representation of audio data, since the modulation produced by the tongue and lips when shifting position while the larynx vibrates induces the frequency changes that make up a phonetic alphabet. A particular example is the spectrogram below. A spectrogram is a heat map representation of a matrix of data in which each pixel has intensity proportional to the magnitude of the matrix entry at that location. In this case, the matrix is a windowed Fourier transform of an audio signal spectrum such that the frequency of the audio signal as a function of time can be observed. This spectrogram shows the frequencies of my voice while singing the first bar of ''Hey Jude''; the bright pixels below 200 Hz show the base note, while the fainter lines at integer multiples of the base notes show resonant harmonics.<br />
<br />
[[File:spect.png | frame | center | Spectrogram of the first bar of Hey Jude, showing the frequency amplitude coefficients changing over time as the intensity of the pixels in the heat map.]]<br />
<br />
== Input Vector Format ==<br />
<br />
For each audio waveform in the TIMIT dataset, the Fourier coefficients were computed with a sliding Discrete Fourier Transform (DFT). The window duration used was 10 ms, corresponding to <math>n_s =<br />
80</math> samples per DFT since each waveform in the corpus was digitally registered with a sampling frequency of <math>f_s = 16</math> kHz, producing 40 unique coefficients at each timestep <math>t</math>, <math>\{c^{[t]}_k\}_{k=1}^{40}</math>. In addition, the first and second time derivatives of the coefficients between adjacent DFT windows were computed (the methodology is unspecified, however most likely this was performed with a numerical central difference technique). Thus, the input vector to the network at step <math>t</math> was the concatenated vector<br />
<br />
<math>{{\mathbf{x}}}_t = [c^{[t]}_1, \frac{d}{dt}c^{[t]}_1,<br />
\frac{d^2}{dt^2}c^{[t]}_1, c_2^{[t]}, \frac{d}{dt}c^{[t]}_2,<br />
\frac{d^2}{dt^2}c^{[t]}_2 \ldots]^T.</math><br />
<br />
Finally, an additional preprocessing step was performed: each input vector was normalized such that the dataset had zero mean and unit variance.<br />
<br />
== RNN Transducer ==<br />
<br />
When building a speech recognition classifier it is important to note that the length of the input and output sequences are of different lengths (sound data to phonemes). Additionally, RNNs require segmented input data. One approach to solve both these problems is to align the output (label) to the input (sound data), but more often than not an aligned dataset is not available. In this paper, the Connectionist Temporal Classification (CTC) method is used to create a probability distribution between inputs and output sequences. This is augmented with an RNN that predicts phonemes given the previous phonemes. The two predictions are then combined into a feed-forward network. The authors call this approach an RNN Transducer. From the distribution of the RNN and CTC, a maximum likelihood decoding for a given input can be computed to find the corresponding output label.<br />
<br />
<math>h(x) = \arg \max_{l \in L^{\leq T}} P(l | x)</math><br />
<br />
Where:<br />
<br />
* <math>h(x)</math>: classifier<br />
* <math>x</math>: input sequence<br />
* <math>l</math>: label<br />
* <math>L</math>: alphabet<br />
* <math>T</math>: maximum sequence length<br />
* <math>P(l | x)</math>: probability distribution of <math>l</math> given <math>x</math><br />
<br />
The value for <math>h(x)</math> cannot computed directly, it is approximated with methods such as Best Path, and Prefix Search Decoding, the authors has chosen to use a graph search algorithm called Beam Search.<br />
<br />
== Network Output Layer ==<br />
<br />
Two different network output layers were used, however most experimental results were reported for a simple softmax probability distribution vector over the set of <math>K =<br />
62</math> symbols, corresponding to the 61 phonemes in the corpus and an additional ''null'' symbol indicating that no phoneme distinct from the previous one was detected. This model is referred to as a Connectionist Temporal Classification (CTC) output function. The other (more complicated) output layer was not rigorously compared with a softmax output, and had nearly identical performance; this summary defers a description of this method, a so-called ''RNN transducer'' to the original paper.<br />
<br />
== Network Training Procedure ==<br />
<br />
The parameters in all ANNs were determined using Stochastic Gradient Descent with a fixed update step size (learning rate) of <math>10^{-4}</math> and a Nesterov momentum term of 0.9. The initial parameters were uniformly randomly drawn from <math>[-0.1,0.1]</math>. The optimization procedure was initially run with data instances from the standard 462 speaker training set of the TIMIT corpus. As a stopping criterion for the training, a secondary testing subset of 50 speakers was used on which the phoneme error rate (PER) was computed in each iteration of the optimization algorithm. The initial training phase for each network was halted once the PER stopped decreasing on the training set; using the parameters at this point as the initial weights, the optimization procedure was then re-run with Gaussian noise with zero mean and <math>\sigma = 0.075</math> added element-wise to the parameters in for each input vector instance <math>({{\mathbf{x}}}_1,\ldots, {{\mathbf{x}}}_T)</math> as a form of regularization. The second optimization procedure was again halted once the PER stopped decreasing on the testing dataset. Multiple trials in each of these numerical experiments were not performed, and as such, the variability in performance due to the initial values of the parameters in the optimization routine is unknown.<br />
<br />
= TIMIT Corpus Experiments &amp; Results =<br />
<br />
== Numerical Experiments ==<br />
<br />
To investigate the performance of the Bidirectional LSTM architecture as a function of depth, numerical experiments were conducted with networks with <math>N \in \{1,2,3,5\}</math> layers and 250 hidden units per layer. These are denoted in the paper by the network names CTC-<math>N</math>L-250H (where <math>N</math> is the layer depth), and are summarized with the number of free model parameters in the table below.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|}<br />
<br />
Additional experiments included: a 1-layer model with 3.8M weights, a 3-layer bidirectional ANN with <math>\tanh</math> activation functions rather than LSTM, a 3-layer unidirectional LSTM model with 3.8M weights (the same number of free parameters as the bidirectional 3-layer LSTM model). Finally, two experiments were performed with a bidirectional LSTM model with with 3 hidden layers each with 250 hidden units, and an RNN transducer output function. One of these experiments using uniformly randomly initialized parameters, and the other using the final (hidden) parameter weights from the CTC-3L-250H model as the initial paratemer values in the optimization algorithm. The names of these experiments are summarized below, where TRANS and PRETRANS denote the RNN transducer experiments initialized randomly, and using (pretrained) parameters from the CTC-3L-250H model, respectively. The suffices UNI and TANH denote the unidirectional and <math>\tanh</math> networks, respectively.<br />
<br />
{|<br />
!Network Name<br />
!# of parameters<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|-<br />
|PreTrans-3l-250h<br />
|4.3M<br />
|}<br />
<br />
== Results ==<br />
<br />
The percentage phoneme error rates and number of epochs in the SGD optimization procedure for the LSTM experiments on the TIMIT dataset with varying network depth are shown below. The PER can be seen to decrease monotonically, however there is negligible difference between 3 and 5 layers—it is possible that the 0.2% difference is within statistical fluctuations induced by the SGD optimization routine and initial parameter values. Note that the allocation of the epochs into either the initial training without noise or the second optimization routine with Gaussian noise added (or both) is unspecified in the paper.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-250h<br />
|0.8M<br />
|82<br />
|23.9%<br />
|-<br />
|CTC-2l-250h<br />
|2.3M<br />
|55<br />
|21.0%<br />
|-<br />
|CTC-3l-250h<br />
|3.8M<br />
|124<br />
|18.6%<br />
|-<br />
|CTC-5l-250h<br />
|6.8M<br />
|150<br />
|18.4%<br />
|}<br />
<br />
The second set of PER results are shown below. The unidirectional LSTM architecture CTC-3L-421H-UNI achieves an error rate that is greater than the CTC-3L-250H model by 1 percentage point. No further comparative experiments between unidirectional and bidirectional models were given, however, and the margin of statistical uncertainty is unknown; thus the 1% (absolute) difference may or may not be significant. The TRANS-3L-250H model achieves a nearly identical PER to the CTC softmax model (0.3%) difference, however note that it has 0.5M ''more'' parameters due to the additional classification network at the output, and is hence not an entirely fair comparison since it has a greater dimensionality. The pretrained model PRETRANS-3L-250H also has 4.3M parameters and sees the best performance with a 17.5% error rate. Note that the difference in training of these two RNN transducer models is primarily in their initialization: the PRETRANS model was initialized using the trained weights of the CTC-3L-250H model (for the hidden layers). Thus, this difference in error rate of 0.6% is the direct result of a different starting iterates in the optimization procedure, which must be kept in mind when comparing between models.<br />
<br />
{|<br />
!Network<br />
!# of Parameters<br />
!Epochs<br />
!PER<br />
|-<br />
|CTC-1l-622h<br />
|3.8M<br />
|87<br />
|23.0%<br />
|-<br />
|CTC-3l-500h-tanh<br />
|3.7M<br />
|107<br />
|37.6%<br />
|-<br />
|CTC-3l-421h-uni<br />
|3.8M<br />
|115<br />
|19.6%<br />
|-<br />
|Trans-3l-250h<br />
|4.3M<br />
|112<br />
|18.3%<br />
|-<br />
|'''PreTrans-3l-250h'''<br />
|'''4.3M'''<br />
|'''144'''<br />
|'''17.7%'''<br />
|}<br />
<br />
= Further works =<br />
The first two authors developed the method to be able to readily be integrated into word-level language models<ref> Graves, A.; Jaitly, N.; Mohamed, A.-R “Hybrid speech recognition with Deep Bidirectional LSTM"</ref>. They used a hybrid approach where frame-level acoustic targets produced by a forced alignment given by a GMM-HMM system. <br />
<br />
= References =<br />
<br />
A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in <span>''Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on''</span>, pp. 6645–6649, IEEE, 2013.<br />
<br />
C. Lopes and F. Perdig<span>ã</span>o, “Phone recognition on the timit database,” <span>''Speech Technologies/Book''</span>, vol. 1, pp. 285–302, 2011.<br />
<br />
A.-r. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” <span>''Audio, Speech, and Language Processing, IEEE Transactions on''</span>, vol. 20, no. 1, pp. 14–22, 2012.<br />
<br />
F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, “Learning precise timing with lstm recurrent networks,” <span>''The Journal of Machine Learning Research''</span>, vol. 3, pp. 115–143, 2003.<br />
<br />
A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm networks,” in <span>''Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on''</span>, vol. 4, pp. 2047–2052, IEEE, 2005.</div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=f15Stat946PaperSignUp&diff=25979f15Stat946PaperSignUp2015-11-09T18:53:27Z<p>Mgohari2: /* Set B */</p>
<hr />
<div> <br />
=[https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/listofpapers1.pdf List of Papers]=<br />
<br />
= Record your contributions [https://docs.google.com/spreadsheets/d/1A_0ej3S6ns3bBMwWLS4pwA6zDLz_0Ivwujj-d1Gr9eo/edit?usp=sharing here:]=<br />
<br />
Use the following notations:<br />
<br />
S: You have written a summary on the paper<br />
<br />
T: You had technical contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
E: You had editorial contribution on a paper (excluding the paper that you present from set A or critique from set B)<br />
<br />
[http://goo.gl/forms/RASFRZXoxJ Your feedback on presentations]<br />
<br />
<br />
=Set A=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Oct 16 || pascal poupart || || Guest Lecturer||||<br />
|-<br />
|Oct 16 ||pascal poupart || ||Guest Lecturer ||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 23 ||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Oct 23 || Deepak Rishi || || Parsing natural scenes and natural language with recursive neural networks || [http://www-nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf Paper] || [[Parsing natural scenes and natural language with recursive neural networks | Summary]]<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Oct 30 ||Rui Qiao || ||Going deeper with convolutions || [http://arxiv.org/pdf/1409.4842v1.pdf Paper]|| [[GoingDeeperWithConvolutions|Summary]]<br />
|-<br />
|Oct 30 ||Amirreza Lashkari|| 21 ||Overfeat: integrated recognition, localization and detection using convolutional networks. || [http://arxiv.org/pdf/1312.6229v4.pdf Paper]|| [[Overfeat: integrated recognition, localization and detection using convolutional networks|Summary]]<br />
|-<br />
|Mkeup Class (TBA) || Peter Blouw|| ||Memory Networks.|| [http://arxiv.org/abs/1410.3916]|| [[Memory Networks|Summary]]<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Ali Ghodsi || || Lecturer||||<br />
|-<br />
|Nov 6 || Anthony Caterini ||56 || Human-level control through deep reinforcement learning ||[http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf Paper]|| [[Human-level control through deep reinforcement learning|Summary]]<br />
|-<br />
|Nov 6 || Sean Aubin || ||Learning Hierarchical Features for Scene Labeling ||[http://yann.lecun.com/exdb/publis/pdf/farabet-pami-13.pdf Paper]||[[Learning Hierarchical Features for Scene Labeling|Summary]]<br />
|-<br />
|Nov 13|| Mike Hynes || 12 ||Speech recognition with deep recurrent neural networks || [http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf Paper] || [[Graves et al., Speech recognition with deep recurrent neural networks|Summary]]<br />
|-<br />
|Nov 13 || Tim Tse || || From Machine Learning to Machine Reasoning ||[http://research.microsoft.com/pubs/206768/mlj-2013.pdf Paper] || [[From Machine Learning to Machine Reasoning | Summary ]]<br />
|-<br />
|Nov 13 || Maysum Panju || ||Neural machine translation by jointly learning to align and translate ||[http://arxiv.org/pdf/1409.0473v6.pdf Paper] || [[Neural Machine Translation: Jointly Learning to Align and Translate|Summary]]<br />
|-<br />
|Nov 13 || Abdullah Rashwan || || Deep neural networks for acoustic modeling in speech recognition. ||[http://research.microsoft.com/pubs/171498/HintonDengYuEtAl-SPM2012.pdf paper]|| [[Deep neural networks for acoustic modeling in speech recognition| Summary]]<br />
|-<br />
|Nov 20 || Valerie Platsko || ||Natural language processing (almost) from scratch. ||[http://arxiv.org/pdf/1103.0398.pdf Paper]||<br />
|-<br />
|Nov 20 || Brent Komer || ||Show, Attend and Tell: Neural Image Caption Generation with Visual Attention || [http://arxiv.org/pdf/1502.03044v2.pdf Paper]||[[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention|Summary]]<br />
|-<br />
|Nov 20 || Luyao Ruan || || Dropout: A Simple Way to Prevent Neural Networks from Overfitting || [https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf Paper]|| [[dropout | Summary]]<br />
|-<br />
|Nov 20 || Ali Mahdipour || || The human splicing code reveals new insights into the genetic determinants of disease ||[https://www.sciencemag.org/content/347/6218/1254806.full.pdf Paper] ||<br />
|-<br />
|Nov 27 ||Mahmood Gohari || ||Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships ||[http://pubs.acs.org/doi/abs/10.1021/ci500747n.pdf Paper]||<br />
|-<br />
|Nov 27 || Derek Latremouille || ||The Wake-Sleep Algorithm for Unsupervised Neural Networks || [http://www.gatsby.ucl.ac.uk/~dayan/papers/hdfn95.pdf Paper] ||<br />
|-<br />
|Nov 27 ||Xinran Liu || ||ImageNet Classification with Deep Convolutional Neural Networks ||[http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Paper]||[[ImageNet Classification with Deep Convolutional Neural Networks|Summary]]<br />
|-<br />
|Nov 27 ||Ali Sarhadi|| ||Strategies for Training Large Scale Neural Network Language Models||||<br />
|-<br />
|Dec 4 || Chris Choi || || On the difficulty of training recurrent neural networks || [http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf Paper] || [[On the difficulty of training recurrent neural networks | Summary]]<br />
|-<br />
|Dec 4 || Fatemeh Karimi || ||MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION||[http://arxiv.org/pdf/1412.7755v2.pdf Paper]||<br />
|-<br />
|Dec 4 || Jan Gosmann || || A fast learning algorithm for deep belief nets || [http://www.mitpressjournals.org/doi/pdf/10.1162/neco.2006.18.7.1527 Paper] || [[A fast learning algorithm for deep belief nets | Summary]]<br />
|-<br />
|Dec 4 || Dylan Drover || || Towards AI-complete question answering: a set of prerequisite toy tasks || [http://arxiv.org/pdf/1502.05698.pdf Paper] ||<br />
|-<br />
|}<br />
|}<br />
<br />
=Set B=<br />
<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="400pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Anthony Caterini ||15 ||The Manifold Tangent Classifier ||[http://papers.nips.cc/paper/4409-the-manifold-tangent-classifier.pdf Paper]||<br />
|-<br />
|Jan Gosmann || || Neural Turing machines || [http://arxiv.org/abs/1410.5401 Paper] || [[Neural Turing Machines|Summary]]<br />
|-<br />
|Brent Komer || || Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers || [http://arxiv.org/pdf/1202.2160v2.pdf Paper] ||<br />
|-<br />
|Sean Aubin || || Deep Sparse Rectifier Neural Networks || [http://jmlr.csail.mit.edu/proceedings/papers/v15/glorot11a/glorot11a.pdf Paper] || [[Deep Sparse Rectifier Neural Networks|Summary]]<br />
|-<br />
|Peter Blouw|| || Generating text with recurrent neural networks || [http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf Paper] ||<br />
|-<br />
|Tim Tse|| || Question answering with subgraph embeddings || [http://arxiv.org/pdf/1406.3676v3.pdf Paper] ||<br />
|-<br />
|Rui Qiao|| || Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation || [http://arxiv.org/pdf/1406.1078v3.pdf Paper] || [[Learning Phrase Representations|Summary]]<br />
|-<br />
|Ftemeh Karimi|| 23 || Very Deep Convoloutional Networks for Large-Scale Image Recognition || [http://arxiv.org/pdf/1409.1556.pdf Paper] || [[Very Deep Convoloutional Networks for Large-Scale Image Recognition|Summary]]<br />
|-<br />
|Amirreza Lashkari|| 43 || Distributed Representations of Words and Phrases and their Compositionality || [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Paper] || [[Distributed Representations of Words and Phrases and their Compositionality|Summary]]<br />
|-<br />
|Xinran Liu|| || Joint training of a convolutional network and a graphical model for human pose estimation || [http://papers.nips.cc/paper/5573-joint-training-of-a-convolutional-network-and-a-graphical-model-for-human-pose-estimation.pdf Paper] || [[Joint training of a convolutional network and a graphical model for human pose estimation|Summary]]<br />
|-<br />
|Chris Choi|| || Learning Long-Range Vision for Autonomous Off-Road Driving || [http://yann.lecun.com/exdb/publis/pdf/hadsell-jfr-09.pdf Paper] || [[Learning Long-Range Vision for Autonomous Off-Road Driving|Summary]]<br />
|-<br />
|Luyao Ruan|| || Deep Learning of the tissue-regulated splicing code || [http://bioinformatics.oxfordjournals.org/content/30/12/i121.full.pdf+html Paper] || [[Deep Learning of the tissue-regulated splicing code| Summary]]<br />
|-<br />
|Abdullah Rashwan|| || Deep Convolutional Neural Networks For LVCSR || [http://www.cs.toronto.edu/~asamir/papers/icassp13_cnn.pdf paper] || [[Deep Convolutional Neural Networks For LVCSR| Summary]]<br />
|-<br />
|Mahmood Gohari||37 || On using very large target vocabulary for neural machine translation || [http://arxiv.org/pdf/1412.2007v2.pdf paper] || [[On using very large target vocabulary for neural machine translation| Summary]]</div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=proposal_for_STAT946_(Deep_Learning)_final_projects_Fall_2015&diff=25978proposal for STAT946 (Deep Learning) final projects Fall 20152015-11-09T18:36:52Z<p>Mgohari2: </p>
<hr />
<div>'''Project 0:''' (This is just an example)<br />
<br />
'''Group members:'''first name family name, first name family name, first name family name<br />
<br />
'''Title:''' Sentiment Analysis on Movie Reviews<br />
<br />
''' Description:''' The idea and data for this project is taken from http://www.kaggle.com/c/sentiment-analysis-on-movie-reviews.<br />
Sentiment analysis is the problem of determining whether a given string contains positive or negative sentiment. For example, “A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story” contains negative sentiment, but it is not immediately clear which parts of the sentence make it so.<br />
This competition seeks to implement machine learning algorithms that can determine the sentiment of a movie review<br />
<br />
'''Project 1:'''<br />
<br />
'''Group members:''' Sean Aubin, Brent Komer<br />
<br />
'''Title:''' Convolution Neural Networks in SLAM<br />
<br />
''' Description:''' We will try to replicate the results reported in [http://arxiv.org/abs/1411.1509 Convolutional Neural Networks-based Place Recognition] using [http://caffe.berkeleyvision.org/ Caffe] and [http://arxiv.org/abs/1409.4842 Google-net]. As a "stretch" goal, we will try to convert the CNN to a spiking neural network (a technique created by Eric Hunsberger) for greater biological plausibility and easier integration with other cognitive systems using Nengo. This work will help Brent with starting his PHD investigating cognitive localisation systems and object manipulation.<br />
<br />
'''Project 2:'''<br />
<br />
'''Group members:''' Xinran Liu, Fatemeh Karimi, Deepak Rishi & Chris Choi<br />
<br />
'''Title:''' Image Classification with Deep Learning<br />
<br />
''' Description:''' Our aim is to participate in the Digital Recognizer Kaggle Challenge, where one has to correctly classify the Modified National Institute of Standards and Technology (MNIST) dataset of handwritten numerical digits. For our first approach we propose using a simple Feed-Forward Neural Network to form a baseline for comparison. We then plan on experimenting on different aspects of a Neural Network such as network architecture, activation functions and incorporate a wide variety of training methods.<br />
<br />
'''Project 3'''<br />
<br />
'''Group members:''' Ri Wang, Maysum Panju, Mahmood Gohari<br />
<br />
'''Title:''' Machine Translation Using Neural Networks<br />
<br />
'''Description:''' The goal of this project is to translate languages using different types of neural networks and the algorithms described in "Sequence to sequence learning with neural networks." and "Neural machine translation by jointly learning to align and translate". Different vector representations for input sentences (word frequency, Word2Vec, etc) will be used and all combinations of algorithms will be ranked in terms of accuracy.<br />
Our data will mainly be from [http://www.statmt.org/europarl/ Europarl] and [https://tatoeba.org/eng Tatoeba]. The common target language will be English to allow for easier judgement of translation quality.<br />
<br />
'''Project 4'''<br />
<br />
'''Group members:''' Peter Blouw, Jan Gosmann<br />
<br />
'''Title:''' Using Structured Representations in Memory Networks to Perform Question Answering<br />
<br />
'''Description:''' Memory networks are machine learning systems that combine memory and inference to perform tasks that involve sophisticated reasoning (see [http://arxiv.org/pdf/1410.3916.pdf here] and [http://arxiv.org/pdf/1502.05698v7.pdf here]). Our goal in this project is to first implement a memory network that replicates prior performance on the bAbl question-answering tasks described in [http://arxiv.org/pdf/1502.05698v7.pdf Weston et al. (2015)]. Then, we hope to improve upon this baseline performance by using more sophisticated representations of the sentences that encode questions being posed to the network. Current implementations often use a bag of words encoding, which throws out important syntactic information that is relevant to determining what a particular question is asking. As such, we will explore the use of things like POS tags, n-gram information, and parse trees to augment memory network performance.<br />
<br />
'''Project 5'''<br />
<br />
'''Group members:''' Anthony Caterini, Tim Tse<br />
<br />
'''Title:''' The Allen AI Science Challenge<br />
<br />
'''Description:''' The goal of this project is to create an artificial intelligence model that can answer multiple-choice questions on a grade 8 science exam, with a success rate better than the best 8th graders. This will involve a deep neural network as the underlying model, to help parse the large amount of information needed to answer these questions. The model should also learn, over time, how to make better answers by acquiring more and more data. This is a Kaggle challenge, and the link to the challenge is [https://www.kaggle.com/c/the-allen-ai-science-challenge here]. The data to produce the model will come from the Kaggle website.<br />
<br />
'''Project 6''' <br />
<br />
'''Group members:''' Valerie Platsko<br />
<br />
'''Title:''' Classification for P300-Speller Using Convolutional Neural Networks <br />
<br />
''' Description:''' The goal of this project is to replicate (and possibly extend) the results in [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5492691 Convolutional Neural Networks for P300 Detection with Application to Brain-Computer Interfaces], which used convolutional neural networks to recognize P300 responses in recorded EEG and additionally to correctly recognize attended targets.(In the P300-Speller application, letters flash in rows and columns, so a single P300 response is associated with multiple potential targets.) The data in the paper came from http://www.bbci.de/competition/iii/ (dataset II), and there is an additional P300 Speller dataset available from [http://www.bbci.de/competition/ii/ a previous version of the competition].<br />
<br />
'''Project 7''' <br />
<br />
'''Group members:''' Amirreza Lashkari, Derek Latremouille, Rui Qiao and Luyao Ruan<br />
<br />
'''Title:''' Right Whale Recognition <br />
<br />
''' Description:''' The goal of this project is to design an automated right whale recognition process using a dataset of aerial photographs of individual whales. To do so, a deep neural network will be applied in order to extract features and classify objects (whales in this problem). This is a Kaggle challenge, and data is also provided by this challenge (see [https://www.kaggle.com/c/noaa-right-whale-recognition here]).</div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Hierarchical_Features_for_Scene_Labeling&diff=25909learning Hierarchical Features for Scene Labeling2015-11-06T18:56:14Z<p>Mgohari2: </p>
<hr />
<div>= Introduction =<br />
<br />
This paper considers the problem of ''scene parsing'', in which every pixel in the image is assigned to a category that delineates a distinct object or region. For instance, an image of a cow on a field can be segmented into an image of a cow and an image of a field, with a clear delineation between the two. An example input image and resultant output is shown below to demonstrate this.<br />
<br />
'''Test input''': The input into the network was a static image such as the one below:<br />
<br />
[[File:cows_in_field.png | 500px ]]<br />
<br />
'''Training data and desired result''': The desired result (which is the same format as the training data given to the network for supervised learning) is an image with large features labelled.<br />
<br />
<gallery widths="500px" heights="400px"><br />
Image:labeled_cows.png|Labeled Result<br />
<br />
</gallery><br />
<br />
[[File:cow_legend.png]]<br />
<br />
One of the difficulties in solving this problem is that traditional convolutional neural networks (CNNs) only take a small region around each pixel into account which is often not sufficient for labeling it as the correct label is determined by the context on a larger scale. To tackle this problems the authors extend the method of sharing weights between spatial locations as in traditional CNNs to share weights across multiple scales. This is achieved by generating multiple scaled versions of the input image. Furthermore, the weight sharing across scales leads to the learning of scale-invariant features.<br />
<br />
A multi-scale convolutional network is trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel for scene labeling. Also a technique is proposed to automatically retrieve an optimal set of components that best explain the scene from a pool of segmentation components.<br />
<br />
= Related work =<br />
<br />
A preliminary work <ref><br />
Grangier, David, Léon Bottou, and Ronan Collobert. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.183.8571&rep=rep1&type=pdf "Deep convolutional networks for scene parsing."] ICML 2009 Deep Learning Workshop. Vol. 3. 2009.<br />
</ref> on using convolutional neural networks for scene parsing showed that CNNs fed with raw pixels could be trained to perform scene parsing with decent accuracy.<br />
<br />
Another previous work <ref><br />
Hannes Schulz and Sven Behnke. [https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2012-160.pdf "Learning Object-Class Segmentation with Convolutional Neural Networks."] 11th European Symposium on Artificial Neural Networks (ESANN). Vol. 3. 2012.<br />
</ref> also uses convolution neural networks for image segmentation. Pairwise class location filters are used to improve the raw output. The current work uses image gradient instead, which was found to increase accuracy and better respect image boundaries.<br />
<br />
= Methodology =<br />
<br />
Below we can see a flow of the overall approach.<br />
<br />
[[File:yann_flow.png | 1200px | frame | center |Figure 1. Diagram of the scene parsing system. The raw input image is transformed through a Laplacian pyramid. Each scale is fed to a 3-stage convolutional network, which produces a set of feature maps. The feature maps of all<br />
scales are concatenated, the coarser-scale maps being upsampled to match the size of the finest-scale map. Each feature vector thus represents a large contextual window around each pixel. In parallel, a single segmentation (i.e. superpixels), or a family of segmentations (e.g. a segmentation tree) are computed to exploit the natural contours of the image. The final labeling is produced from the feature vectors and the segmentation(s) using different methods. ]]<br />
<br />
This model consists of two parallel components which are two complementary image representations. In the first component, an image patch is seen as a point in <math>\mathbb R^P</math> and we seek to find a transform <math>f:\mathbb R^P \to \mathbb R^Q</math> that maps each path into <math>\mathbb R^Q</math>, a space where it can be classified linearly. This stage usually suffers from two main problems with traditional convolutional neural networks: (1) the window considered rarely contains an object that is centred and scaled, (2) integrating a large context involves increasing the grid size and therefore the dimensionality of <math>P</math> and hence, it is then necessary to enforce some invariance in the function <math>f</math> itself. This is usually achieved through pooling but this degrades the model to precisely locate and delineate objects. In this paper, <math>f</math> is implemented by a mutliscale convolutional neural network, which allows integrating large contexts in local decisions while remaining manageable in terms of parameters/dimensionality. In the second component, the image is seen as an edge-weighted graph, on which one or several oversegmentations can be constructed. The components are spatially accurate and naturally delineates objects as this representation conserves pixel-level precision. A classifier is then applied to the aggregated feature grid of each node.<br />
<br />
== Pre-processing ==<br />
<br />
Before being put into the Convolutional Neural Network (CNN) multiple scaled versions of the image are generated. The set of these scaled images is called a ''pyramid''. There were three different scale outputs of the image created, in a similar manner shown in the picture below<br />
<br />
[[File:Image_pyramid.png ]]<br />
<br />
The scaling can be done by different transforms; the paper suggests to use the Laplacian transform. The Laplacian is the sum of partial second derivatives <math>\nabla^2 f = \frac{\partial^2 f}{\partial x^2} + \frac{\partial^2 f}{\partial y^2}</math>. A two-dimensional discrete approximation is given by the matrix <math>\left[\begin{array}{ccc}0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0\end{array}\right]</math>.<br />
<br />
== Network Architecture ==<br />
<br />
The proposed scene parsing architecture has two main components: Multi-scale convolutional representation and Graph-based classification.<br />
<br />
In the first representation, for each scale of the Laplacian pyramid, a typical 3-stage (Each of the first 2 stages is composed of three layers: convolution of kernel with feature map, non-linearity, pooling) CNN architecture was used. The function tanh served as the non-linearity. The kernel being used were 7x7 Toeplitz matrices (matrices with constant values along their diagonals). The pooling operation was performed by the 2x2 max-pool operator. The same CNN was applied to all different sized images. Since the parameters were shared between the networks, the ''same'' connection weights were applied to all of the images, thus allowing for the detection of scale-invariant features. The outputs of all CNNs at each scale are upsampled and concatenated to produce a map of feature vectors. The author believe that the more scales used to jointly train the models, the better the representation becomes for all scales.<br />
<br />
In the second representation, the image is seen as an edge-weighted graph<ref><br />
Shotton, Jamie, et al.[http://www.csd.uwo.ca/~olga/Courses/Fall2013/CS9840/PossibleStudentPapers/eccv06.pdf "Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation." ]Computer Vision–ECCV 2006. Springer Berlin Heidelberg, 2006. 1-15.<br />
</ref><ref><br />
Fulkerson, Brian, Andrea Vedaldi, and Stefano Soatto. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.150.4613&rep=rep1&type=pdf "Class segmentation and object localization with superpixel neighborhoods."] Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009.<br />
</ref>, on which one or several over-segmentations can be constructed and used to group the feature descriptors. This graph segmentation technique was taken from another paper<ref><br />
Felzenszwalb, Pedro F., and Daniel P. Huttenlocher.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.150.4613&rep=rep1&type=pdf "Efficient graph-based image segmentation."] International Journal of Computer Vision 59.2 (2004): 167-181.<br />
</ref>. Three techniques are proposed to produce the final image labelling as discussed below in the Post-Processing section.<br />
<br />
Stochastic gradient descent was used for training the filters. To avoid over-fitting the training images were edited via jitter, horizontal flipping, rotations between +8 and -8, and rescaling between 90 and 110%. The objective function was the ''cross entropy'' loss function, [https://jamesmccaffrey.wordpress.com/2013/11/05/why-you-should-use-cross-entropy-error-instead-of-classification-error-or-mean-squared-error-for-neural-network-classifier-training/ which is a way to take into account the closeness of a prediction into the error]. With respect to the actual training procedure, once the output feature maps for each image in the laplacian pyramid are concatenated to produce the final map of feature vectors, these feature vectors are passed through a linear classifier that produces a probability distribution over class labels for each pixel location <math>i</math> via the softmax function. In other words, multi-class logistic regression is used to predict class labels for each pixel. Since each pixel in the image has a ground-truth class label, there is target distribution for each pixel that can be compared to the distribution produced by the linear classifier. This comparison allows one to define the cross-entropy loss function that is used to train the filters in the CNN. In short, the derivative of the cross entropy error with respect to the input to each class label unit can be backpropogated through the CNN to iteratively minimize the loss function. Once training is complete, taking the argmax of the predicted class distribution for each pixel gives a labelling to the entire scene. However, this labelling lacks spatial coherence, and hence the post processing techniques described in the next section are used to introduce refinements. <br />
<br />
== Post-Processing ==<br />
<br />
Unlike previous approaches, the emphasis of this scene-labelling method was to rely on a highly accurate pixel labelling system. So, despite the fact that a variety of approaches were attempted, including SuperPixels, Conditional Random Fields and gPb, the simple approach of super-pixels yielded state of the art results.<br />
<br />
SuperPixels are randomly generated chunks of pixels. To label these pixels, a two layer neural network was used. Given inputs from the map of feature vectors produced by the CNN, the 2-layer network produces a distribution over class labels for each pixel in the superpixel. These distributions are averaged, and the argmax of the resulting average is then chosen as the final label for the super pixel. The picture below shows the general approach><br />
<br />
[[File:super_pix.png]]<br />
<br />
=== Conditional Random Fields ===<br />
<br />
A standard approach for labelling is training a CRF model on the superpixels. It consists of associating the image to a graph <math>(V,E)</math> where each vertex <math>v \in V</math> is a pixel of the image, and the edges <math>E = V \times V</math> exist for all neighbouring pixels. Let <math>l_i</math> be the labelling for the <math>i</math>th pixel. The CRF energy function is used to minimize the difference between the predicted label <math>d</math>, as well as a term that penalizes local changes in labels. The first difference term is <math>\Phi(d_i,l_i)</math> for all pixels. The second term is motivated by the idea that labels should have broad spatial range in general for good segmentations. As such, a penalty term is applied for each pixel <math>v_i</math> adjacent to a pixel <math>v_j</math> having a ''different'' label, as <math>\Phi(l_i,l_j)</math> for each <math>e_{ij} \in E</math>. Hence the energy function has the form<br />
<br />
[[File:Paper1p1.png ]]<br />
<br />
where <math>\Phi(d_i,l_i) = \begin{cases}<br />
e^{-\alpha d_i}& (l_i \neq d_i) \\<br />
0 & \text{else}\\<br />
\end{cases},</math><br />
<br />
and<br />
<br />
<math>\Psi(l_i,l_j) = \begin{cases}<br />
e^{-\beta \|\nabla I\|_i}& (l_i \neq d_i) \\<br />
0 & \text{else}\\<br />
\end{cases}</math><br />
<br />
for constants <math>\alpha,\beta,\gamma > 0</math>.<br />
<br />
The entire process of using CRF can be summarized below.<br />
<br />
[[File:Paper2p2.png | 1200px ]]<br />
<br />
= Model =<br />
<br />
'''Scale-invariant, Scene-level feature extraction'''<br />
<br />
Given an input image, a multiscale pyramid of images <math>\ X_s </math>, where <math>s</math> belongs to {1,...,N}, is constructed. The multiscale pyramid is typically pre-processed, so that local neighborhoods have zero mean and unit standard deviation. We denote <math>f_s</math> as a classical convolutional network with parameter <math>\theta_s</math>, where <math>\theta_s</math> is shared across <math>f_s</math>. <br />
<br />
For a network <math>f_s</math> with L layers, we have regular convolutional network:<br />
<br />
<math>\ f_s(X_s; \theta_s)=W_LH_{L-1}</math>.<br />
<br />
<math>\ H_L </math> is the vector of hidden units at layer L, where:<br />
<br />
<math>\ H_l=pool(tanh(W_lH_{l-1}+b_l))</math>, <math> b_l </math> is a vector of bias parameter<br />
<br />
Finally, the output of N networks are upsampled and concatenated so as to produce F:<br />
<br />
<math>\ F= [f_1, u(f_2), ... , u(f_N)]</math>, where <math> u</math> is an upsampling function.<br />
<br /> <br />
<br />
''' Learning discriminative scale-invariant features'''<br />
<br />
Ideally a linear classifier should produce the correct categorization for all pixel locations ''i'', from the feature vectors <math>F_{i}</math>. We train the parameters <math>\theta_{s}</math> to achieve this goal, using the multiclass ''cross entropy'' loss function. Let <math>\hat{c_{i}}</math> be the normalized prediction vector from the linear classifier for pixel ''i''. We compute normalized predicted probability distributions over classes <math>\hat{c}_{i,a}</math> using the softmax function:<br />
<br />
<math><br />
\hat{c}_{i,a} = \frac{e^{w^{T}_{a} F_{i}}}{\sum\nolimits_{b \in classes} e^{w^{T}_{b} F_{i}}}<br />
</math><br />
<br />
where <math>w</math> is a temporary weight matrix only used to learn the features. The cross entropy between the predicted class distribution <math>\hat{c}</math> and the target class distribution <math>c</math> penalizes their deviation and is measured by<br />
<br />
<math><br />
L_{cat} = \sum\limits_{i \in pixels} \sum\limits_{a \in classes} c_{i,a} ln(\hat{c}_{i,a})<br />
</math><br />
<br />
The true target probability <math>c_{i,a}</math> of class <math>a</math> to be present at location <math>i</math> can either be a distribution of classes at location <math>i</math>, in a given neighborhood or a hard target vector: <math>c_{i,a} = 1</math> if pixel <math>i</math> is labeled <math>a</math>, and <math>0</math> otherwise. For training maximally discriminative features, we use hard target vectors in this first stage. Once the parameters <math>\theta s</math> are trained, the classifier is discarded, and the feature vectors <math>F_{i}</math> are used using different strategies, as explained later.<br />
<br />
''' Classification '''<br />
<br />
Having <math>\ F</math>, we now want to classify the superpixels.<br />
<br />
<math>\ y_i= W_2tanh(W_1F_i+b_1)</math>, <br />
<br />
<math>\ W_1</math> and <math>\ W_2</math> are trainable parameters of the classifier. <br />
<br />
<math>\ \hat{d_{i,a}}=\frac{e^{y_{i,a}}}{\sum_{b\in classes}{e^{y_{i,b}}}}</math>, <br />
<br />
<math> \hat{d_{i,a}}</math> is the predicted class distribution from the linear classifier for pixel <math>i</math> and class <math>a</math>.<br />
<br />
<math>\ \hat{d_{k,a}}= \frac{1}{s(k)}\sum_{i\in k}{\hat{d_{i,a}}}</math>,<br />
<br />
where <math>\hat{d_k}</math> is the pixelwise distribution at superpixel k, <math> s(k)</math> is the surface of component k. <br />
<br />
In this case, the final labeling for each component <math>k</math> is given by:<br />
<br />
<math>\ l_k=argmax_{a\in classes}{\hat{d_{k,a}}}</math><br />
<br />
= Results =<br />
<br />
The network was tested on the Stanford Background, SIFT Flow and Barcelona datasets.<br />
<br />
The Stanford Background dataset shows that super-pixels could achieve state of the art results with minimal processing times.<br />
<br />
[[File:stanford_res.png]]<br />
<br />
Since super-pixels were shown to be so effective in the Stanford Dataset, they were the only method of image segmentation used for the SIFT Flow and Barcelona datasets. Instead, exposure of features to the network (whether balanced as super-index 1 or natural as super-index 2) were explored, in conjunction with the aforementioned Graph Based Segmentation method, when combined with the optimal cover algorithm.<br />
<br />
From the sift dataset, it can be seen that the Graph Based Segmentation with optimal cover method offers a significant advantage.<br />
<br />
[[File:sift_res.png]]<br />
<br />
In the Barcelona dataset, it can be seen that a dataset with many labels is too difficult for the CNN.<br />
<br />
[[File:barcelona_res.png]]<br />
<br />
= Conclusions =<br />
<br />
A wide window for contextual information, achieved through the multiscale network, improves the results largely and diminishes the role of the post-processing stage. This allows to replace the computational expensive post-processing with a simpler and faster method (e.g., majority vote) to increase the efficiency without a relevant loss in classification accuracy. The paper has demonstrated that a feed-forward convolutional network, trained end-to-end and fed with raw pixels can produce state of the art performance on scene parsing datasets. The model does not rely on engineered features, and uses purely supervised training from fully-labeled images.<br />
<br />
An interesting find in this paper is that even in the absence of any post-processing, by simply labelling each pixel with highest-scoring category produced by he convolutional net for that location, the system yields near sate-of-the-art pixel-wise accuracy.<br />
<br />
= Future Work =<br />
<br />
Aside from the usual advances to CNN architecture, such as unsupervised pre-training, rectifying non-linearities and local contrast normalization, there would be a significant benefit, especially in datasets with many variables, to have a semantic understanding of the variables. For example, understanding that a window is often part of a building or a car.<br />
<br />
There would also be considerable benefit from improving the metrics used in scene parsing. The current pixel-wise accuracy is a somewhat uninformative measure of the quality of the result. Spotting rare objects is often more important than correctly labeling every boundary pixel of a large region such as the sky. The average per-class accuracy is a step in the right direction, but the authors would prefer a system that correctly spots every object or region, while giving an approximate boundary to a system that produces accurate boundaries for large regions (sky, road, grass, etc), but fails to spot small objects.<br />
<br />
Long et al <ref><br />
Long J, et al . [http://arxiv.org/pdf/1411.4038v2.pdf "Fully Convolutional Networks for Semantic Segmentation"]<br />
</ref> used a fully convolutional networks with extending classification nets to segmentation, and improving the architecture with multi-resolution layer combinations. They compared their algorithm to the Farbet et al approach and improved the pixel accuracy up to six percent.<br />
<br />
=References=<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Hierarchical_Features_for_Scene_Labeling&diff=25908learning Hierarchical Features for Scene Labeling2015-11-06T18:51:29Z<p>Mgohari2: /* Future Work */</p>
<hr />
<div>= Introduction =<br />
<br />
This paper considers the problem of ''scene parsing'', in which every pixel in the image is assigned to a category that delineates a distinct object or region. For instance, an image of a cow on a field can be segmented into an image of a cow and an image of a field, with a clear delineation between the two. An example input image and resultant output is shown below to demonstrate this.<br />
<br />
'''Test input''': The input into the network was a static image such as the one below:<br />
<br />
[[File:cows_in_field.png | 500px ]]<br />
<br />
'''Training data and desired result''': The desired result (which is the same format as the training data given to the network for supervised learning) is an image with large features labelled.<br />
<br />
<gallery widths="500px" heights="400px"><br />
Image:labeled_cows.png|Labeled Result<br />
<br />
</gallery><br />
<br />
[[File:cow_legend.png]]<br />
<br />
One of the difficulties in solving this problem is that traditional convolutional neural networks (CNNs) only take a small region around each pixel into account which is often not sufficient for labeling it as the correct label is determined by the context on a larger scale. To tackle this problems the authors extend the method of sharing weights between spatial locations as in traditional CNNs to share weights across multiple scales. This is achieved by generating multiple scaled versions of the input image. Furthermore, the weight sharing across scales leads to the learning of scale-invariant features.<br />
<br />
A multi-scale convolutional network is trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel for scene labeling. Also a technique is proposed to automatically retrieve an optimal set of components that best explain the scene from a pool of segmentation components.<br />
<br />
= Related work =<br />
<br />
A preliminary work <ref><br />
Grangier, David, Léon Bottou, and Ronan Collobert. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.183.8571&rep=rep1&type=pdf "Deep convolutional networks for scene parsing."] ICML 2009 Deep Learning Workshop. Vol. 3. 2009.<br />
</ref> on using convolutional neural networks for scene parsing showed that CNNs fed with raw pixels could be trained to perform scene parsing with decent accuracy.<br />
<br />
Another previous work <ref><br />
Hannes Schulz and Sven Behnke. [https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2012-160.pdf "Learning Object-Class Segmentation with Convolutional Neural Networks."] 11th European Symposium on Artificial Neural Networks (ESANN). Vol. 3. 2012.<br />
</ref> also uses convolution neural networks for image segmentation. Pairwise class location filters are used to improve the raw output. The current work uses image gradient instead, which was found to increase accuracy and better respect image boundaries.<br />
<br />
= Methodology =<br />
<br />
Below we can see a flow of the overall approach.<br />
<br />
[[File:yann_flow.png | 1200px | frame | center |Figure 1. Diagram of the scene parsing system. The raw input image is transformed through a Laplacian pyramid. Each scale is fed to a 3-stage convolutional network, which produces a set of feature maps. The feature maps of all<br />
scales are concatenated, the coarser-scale maps being upsampled to match the size of the finest-scale map. Each feature vector thus represents a large contextual window around each pixel. In parallel, a single segmentation (i.e. superpixels), or a family of segmentations (e.g. a segmentation tree) are computed to exploit the natural contours of the image. The final labeling is produced from the feature vectors and the segmentation(s) using different methods. ]]<br />
<br />
This model consists of two parallel components which are two complementary image representations. In the first component, an image patch is seen as a point in <math>\mathbb R^P</math> and we seek to find a transform <math>f:\mathbb R^P \to \mathbb R^Q</math> that maps each path into <math>\mathbb R^Q</math>, a space where it can be classified linearly. This stage usually suffers from two main problems with traditional convolutional neural networks: (1) the window considered rarely contains an object that is centred and scaled, (2) integrating a large context involves increasing the grid size and therefore the dimensionality of <math>P</math> and hence, it is then necessary to enforce some invariance in the function <math>f</math> itself. This is usually achieved through pooling but this degrades the model to precisely locate and delineate objects. In this paper, <math>f</math> is implemented by a mutliscale convolutional neural network, which allows integrating large contexts in local decisions while remaining manageable in terms of parameters/dimensionality. In the second component, the image is seen as an edge-weighted graph, on which one or several oversegmentations can be constructed. The components are spatially accurate and naturally delineates objects as this representation conserves pixel-level precision. A classifier is then applied to the aggregated feature grid of each node.<br />
<br />
== Pre-processing ==<br />
<br />
Before being put into the Convolutional Neural Network (CNN) multiple scaled versions of the image are generated. The set of these scaled images is called a ''pyramid''. There were three different scale outputs of the image created, in a similar manner shown in the picture below<br />
<br />
[[File:Image_pyramid.png ]]<br />
<br />
The scaling can be done by different transforms; the paper suggests to use the Laplacian transform. The Laplacian is the sum of partial second derivatives <math>\nabla^2 f = \frac{\partial^2 f}{\partial x^2} + \frac{\partial^2 f}{\partial y^2}</math>. A two-dimensional discrete approximation is given by the matrix <math>\left[\begin{array}{ccc}0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0\end{array}\right]</math>.<br />
<br />
== Network Architecture ==<br />
<br />
The proposed scene parsing architecture has two main components: Multi-scale convolutional representation and Graph-based classification.<br />
<br />
In the first representation, for each scale of the Laplacian pyramid, a typical 3-stage (Each of the first 2 stages is composed of three layers: convolution of kernel with feature map, non-linearity, pooling) CNN architecture was used. The function tanh served as the non-linearity. The kernel being used were 7x7 Toeplitz matrices (matrices with constant values along their diagonals). The pooling operation was performed by the 2x2 max-pool operator. The same CNN was applied to all different sized images. Since the parameters were shared between the networks, the ''same'' connection weights were applied to all of the images, thus allowing for the detection of scale-invariant features. The outputs of all CNNs at each scale are upsampled and concatenated to produce a map of feature vectors. The author believe that the more scales used to jointly train the models, the better the representation becomes for all scales.<br />
<br />
In the second representation, the image is seen as an edge-weighted graph<ref><br />
Shotton, Jamie, et al.[http://www.csd.uwo.ca/~olga/Courses/Fall2013/CS9840/PossibleStudentPapers/eccv06.pdf "Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation." ]Computer Vision–ECCV 2006. Springer Berlin Heidelberg, 2006. 1-15.<br />
</ref><ref><br />
Fulkerson, Brian, Andrea Vedaldi, and Stefano Soatto. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.150.4613&rep=rep1&type=pdf "Class segmentation and object localization with superpixel neighborhoods."] Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009.<br />
</ref>, on which one or several over-segmentations can be constructed and used to group the feature descriptors. This graph segmentation technique was taken from another paper<ref><br />
Felzenszwalb, Pedro F., and Daniel P. Huttenlocher.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.150.4613&rep=rep1&type=pdf "Efficient graph-based image segmentation."] International Journal of Computer Vision 59.2 (2004): 167-181.<br />
</ref>. Three techniques are proposed to produce the final image labelling as discussed below in the Post-Processing section.<br />
<br />
Stochastic gradient descent was used for training the filters. To avoid over-fitting the training images were edited via jitter, horizontal flipping, rotations between +8 and -8, and rescaling between 90 and 110%. The objective function was the ''cross entropy'' loss function, [https://jamesmccaffrey.wordpress.com/2013/11/05/why-you-should-use-cross-entropy-error-instead-of-classification-error-or-mean-squared-error-for-neural-network-classifier-training/ which is a way to take into account the closeness of a prediction into the error]. With respect to the actual training procedure, once the output feature maps for each image in the laplacian pyramid are concatenated to produce the final map of feature vectors, these feature vectors are passed through a linear classifier that produces a probability distribution over class labels for each pixel location <math>i</math> via the softmax function. In other words, multi-class logistic regression is used to predict class labels for each pixel. Since each pixel in the image has a ground-truth class label, there is target distribution for each pixel that can be compared to the distribution produced by the linear classifier. This comparison allows one to define the cross-entropy loss function that is used to train the filters in the CNN. In short, the derivative of the cross entropy error with respect to the input to each class label unit can be backpropogated through the CNN to iteratively minimize the loss function. Once training is complete, taking the argmax of the predicted class distribution for each pixel gives a labelling to the entire scene. However, this labelling lacks spatial coherence, and hence the post processing techniques described in the next section are used to introduce refinements. <br />
<br />
== Post-Processing ==<br />
<br />
Unlike previous approaches, the emphasis of this scene-labelling method was to rely on a highly accurate pixel labelling system. So, despite the fact that a variety of approaches were attempted, including SuperPixels, Conditional Random Fields and gPb, the simple approach of super-pixels yielded state of the art results.<br />
<br />
SuperPixels are randomly generated chunks of pixels. To label these pixels, a two layer neural network was used. Given inputs from the map of feature vectors produced by the CNN, the 2-layer network produces a distribution over class labels for each pixel in the superpixel. These distributions are averaged, and the argmax of the resulting average is then chosen as the final label for the super pixel. The picture below shows the general approach><br />
<br />
[[File:super_pix.png]]<br />
<br />
=== Conditional Random Fields ===<br />
<br />
A standard approach for labelling is training a CRF model on the superpixels. It consists of associating the image to a graph <math>(V,E)</math> where each vertex <math>v \in V</math> is a pixel of the image, and the edges <math>E = V \times V</math> exist for all neighbouring pixels. Let <math>l_i</math> be the labelling for the <math>i</math>th pixel. The CRF energy function is used to minimize the difference between the predicted label <math>d</math>, as well as a term that penalizes local changes in labels. The first difference term is <math>\Phi(d_i,l_i)</math> for all pixels. The second term is motivated by the idea that labels should have broad spatial range in general for good segmentations. As such, a penalty term is applied for each pixel <math>v_i</math> adjacent to a pixel <math>v_j</math> having a ''different'' label, as <math>\Phi(l_i,l_j)</math> for each <math>e_{ij} \in E</math>. Hence the energy function has the form<br />
<br />
[[File:Paper1p1.png ]]<br />
<br />
where <math>\Phi(d_i,l_i) = \begin{cases}<br />
e^{-\alpha d_i}& (l_i \neq d_i) \\<br />
0 & \text{else}\\<br />
\end{cases},</math><br />
<br />
and<br />
<br />
<math>\Psi(l_i,l_j) = \begin{cases}<br />
e^{-\beta \|\nabla I\|_i}& (l_i \neq d_i) \\<br />
0 & \text{else}\\<br />
\end{cases}</math><br />
<br />
for constants <math>\alpha,\beta,\gamma > 0</math>.<br />
<br />
The entire process of using CRF can be summarized below.<br />
<br />
[[File:Paper2p2.png | 1200px ]]<br />
<br />
= Model =<br />
<br />
'''Scale-invariant, Scene-level feature extraction'''<br />
<br />
Given an input image, a multiscale pyramid of images <math>\ X_s </math>, where <math>s</math> belongs to {1,...,N}, is constructed. The multiscale pyramid is typically pre-processed, so that local neighborhoods have zero mean and unit standard deviation. We denote <math>f_s</math> as a classical convolutional network with parameter <math>\theta_s</math>, where <math>\theta_s</math> is shared across <math>f_s</math>. <br />
<br />
For a network <math>f_s</math> with L layers, we have regular convolutional network:<br />
<br />
<math>\ f_s(X_s; \theta_s)=W_LH_{L-1}</math>.<br />
<br />
<math>\ H_L </math> is the vector of hidden units at layer L, where:<br />
<br />
<math>\ H_l=pool(tanh(W_lH_{l-1}+b_l))</math>, <math> b_l </math> is a vector of bias parameter<br />
<br />
Finally, the output of N networks are upsampled and concatenated so as to produce F:<br />
<br />
<math>\ F= [f_1, u(f_2), ... , u(f_N)]</math>, where <math> u</math> is an upsampling function.<br />
<br /> <br />
<br />
''' Learning discriminative scale-invariant features'''<br />
<br />
Ideally a linear classifier should produce the correct categorization for all pixel locations ''i'', from the feature vectors <math>F_{i}</math>. We train the parameters <math>\theta_{s}</math> to achieve this goal, using the multiclass ''cross entropy'' loss function. Let <math>\hat{c_{i}}</math> be the normalized prediction vector from the linear classifier for pixel ''i''. We compute normalized predicted probability distributions over classes <math>\hat{c}_{i,a}</math> using the softmax function:<br />
<br />
<math><br />
\hat{c}_{i,a} = \frac{e^{w^{T}_{a} F_{i}}}{\sum\nolimits_{b \in classes} e^{w^{T}_{b} F_{i}}}<br />
</math><br />
<br />
where <math>w</math> is a temporary weight matrix only used to learn the features. The cross entropy between the predicted class distribution <math>\hat{c}</math> and the target class distribution <math>c</math> penalizes their deviation and is measured by<br />
<br />
<math><br />
L_{cat} = \sum\limits_{i \in pixels} \sum\limits_{a \in classes} c_{i,a} ln(\hat{c}_{i,a})<br />
</math><br />
<br />
The true target probability <math>c_{i,a}</math> of class <math>a</math> to be present at location <math>i</math> can either be a distribution of classes at location <math>i</math>, in a given neighborhood or a hard target vector: <math>c_{i,a} = 1</math> if pixel <math>i</math> is labeled <math>a</math>, and <math>0</math> otherwise. For training maximally discriminative features, we use hard target vectors in this first stage. Once the parameters <math>\theta s</math> are trained, the classifier is discarded, and the feature vectors <math>F_{i}</math> are used using different strategies, as explained later.<br />
<br />
''' Classification '''<br />
<br />
Having <math>\ F</math>, we now want to classify the superpixels.<br />
<br />
<math>\ y_i= W_2tanh(W_1F_i+b_1)</math>, <br />
<br />
<math>\ W_1</math> and <math>\ W_2</math> are trainable parameters of the classifier. <br />
<br />
<math>\ \hat{d_{i,a}}=\frac{e^{y_{i,a}}}{\sum_{b\in classes}{e^{y_{i,b}}}}</math>, <br />
<br />
<math> \hat{d_{i,a}}</math> is the predicted class distribution from the linear classifier for pixel <math>i</math> and class <math>a</math>.<br />
<br />
<math>\ \hat{d_{k,a}}= \frac{1}{s(k)}\sum_{i\in k}{\hat{d_{i,a}}}</math>,<br />
<br />
where <math>\hat{d_k}</math> is the pixelwise distribution at superpixel k, <math> s(k)</math> is the surface of component k. <br />
<br />
In this case, the final labeling for each component <math>k</math> is given by:<br />
<br />
<math>\ l_k=argmax_{a\in classes}{\hat{d_{k,a}}}</math><br />
<br />
= Results =<br />
<br />
The network was tested on the Stanford Background, SIFT Flow and Barcelona datasets.<br />
<br />
The Stanford Background dataset shows that super-pixels could achieve state of the art results with minimal processing times.<br />
<br />
[[File:stanford_res.png]]<br />
<br />
Since super-pixels were shown to be so effective in the Stanford Dataset, they were the only method of image segmentation used for the SIFT Flow and Barcelona datasets. Instead, exposure of features to the network (whether balanced as super-index 1 or natural as super-index 2) were explored, in conjunction with the aforementioned Graph Based Segmentation method, when combined with the optimal cover algorithm.<br />
<br />
From the sift dataset, it can be seen that the Graph Based Segmentation with optimal cover method offers a significant advantage.<br />
<br />
[[File:sift_res.png]]<br />
<br />
In the Barcelona dataset, it can be seen that a dataset with many labels is too difficult for the CNN.<br />
<br />
[[File:barcelona_res.png]]<br />
<br />
= Conclusions =<br />
<br />
A wide window for contextual information, achieved through the multiscale network, improves the results largely and diminishes the role of the post-processing stage. This allows to replace the computational expensive post-processing with a simpler and faster method (e.g., majority vote) to increase the efficiency without a relevant loss in classification accuracy. The paper has demonstrated that a feed-forward convolutional network, trained end-to-end and fed with raw pixels can produce state of the art performance on scene parsing datasets. The model does not rely on engineered features, and uses purely supervised training from fully-labeled images.<br />
<br />
An interesting find in this paper is that even in the absence of any post-processing, by simply labelling each pixel with highest-scoring category produced by he convolutional net for that location, the system yields near sate-of-the-art pixel-wise accuracy.<br />
<br />
= Future Work =<br />
<br />
Aside from the usual advances to CNN architecture, such as unsupervised pre-training, rectifying non-linearities and local contrast normalization, there would be a significant benefit, especially in datasets with many variables, to have a semantic understanding of the variables. For example, understanding that a window is often part of a building or a car.<br />
<br />
There would also be considerable benefit from improving the metrics used in scene parsing. The current pixel-wise accuracy is a somewhat uninformative measure of the quality of the result. Spotting rare objects is often more important than correctly labeling every boundary pixel of a large region such as the sky. The average per-class accuracy is a step in the right direction, but the authors would prefer a system that correctly spots every object or region, while giving an approximate boundary to a system that produces accurate boundaries for large regions (sky, road, grass, etc), but fails to spot small objects.<br />
<br />
Long et al used a fully convolutional networks with extending classification nets to segmentation, and improving the architecture with multi-resolution layer combinations. They compared their algorithm to the Farbet et al approach and improved the pixel accuracy up to six percnt.<br />
<br />
=References=<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=human-level_control_through_deep_reinforcement_learning&diff=25878human-level control through deep reinforcement learning2015-11-05T23:30:35Z<p>Mgohari2: </p>
<hr />
<div>== Introduction ==<br />
<br />
Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. [http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf "Reinforcement Learning."] Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.<br />
<br />
When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. [http://www.gatsby.ucl.ac.uk/~dayan/papers/sdm97.pdf A neural substrate of prediction and reward.] Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. [http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf TD-Gammon, a self-teaching backgammon program, achieves master's play.] AAAI Techinical Report (1993)</ref>.<br />
<br />
In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.<br />
<br />
== Problem Description ==<br />
<br />
The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]</math><br />
<br />
which is the maximum cumulative sum of rewards <math>r_t\,</math> discounted by <math>\,\gamma</math> at each timestep <math>t\,</math>. This sum can be achieved from a policy <math>\pi = P\left(a|s\right)</math> after making an observation <math>\,s</math> and taking an action <math>\,a</math> <ref name = "main"></ref>.<br />
<br />
During learning, we apply Q-learning updates, on samples (or minibatches) of experience <math>(s,a,r,s')\sim U(D)</math>, drawn uniformly at random from the pool of stored samples. The Q-learning update at iteration ''I'' uses the following loss function:<br />
<br />
<math><br />
L_i(\theta_i) = \mathbb E_{(s,a,r,s')\sim U(D)}[(r+\gamma \max_{a'} Q(s',a';\theta_i^-)-Q(s,a;\theta_i))^2]<br />
</math><br />
<br />
=== Instability of Neural Networks as Function Estimate ===<br />
<br />
Unfortunately, current methods which use deep networks to estimate <math>Q\,</math>suffer from instability or divergence for the following reasons: <br />
<br />
# Correlation within sequence of observations<br />
# Small updates to <math>Q\,</math>can significantly change the policy, and thus the data distribution<br />
# The action values <math>Q\,</math>are correlated with the target values <math>\,y = r_t + \gamma \max_{a'}Q(s', a')</math><br />
<br />
=== Overcoming Instability ===<br />
<br />
One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing <math>e_t = \left(s_t, a_t, r_t, s_{t+1}\right)</math> - known as the "experiences" - at each time step in a dataset <math>D_t = \left(e_1, e_2, \ldots, e_t\right)</math>. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from <math>\,D_t</math>. In practice, only <math>N</math> experiences are stored, where <math>N</math> is some large, finite number (e.g. <math>N=10^6</math>). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past <math>N</math> states. This makes it much more unlikely that instability or divergence will occur.<br />
<br />
This approach has several advantages over standard online Q-learning. First, each step of experience is potentially used in many weight updates, which allows for greater data efficiency. Second, learning directly from consecutive samples is inefficient, owing to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates. Third, when learning on-policy the current parameters determine the next data sample that the parameters are trained on. Or in other words, the updates to the network parameters and the selection of data samples used to calculate these updates are not independent processes. This can lead to unwanted feedback loops in the sample-update cycle that force the network to converge on a poor local optimum of the loss function. The use of experience replay solves this problem essentially by introducing a greater degree of independence between the choice of training samples and the current network parameters at a particular iteration in the learning procedure.<br />
<br />
Another method used to combat instability is to use a separate network for generating the targets <math>y_i</math> as opposed to the same network. This is implemented by cloning the network <math>\,Q</math> every <math>\,C</math> iterations, and using this static, cloned network to generate the next <math>\,C</math> target values. Thus, the target network parameters are only updated with the Q-network parameters every <math>\,C</math> steps and are held fixed between individual updates. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, <math>\,C = 10^4</math>.<br />
<br />
They also found that clipping the error term to be between -1 and 1 can further improve the stability of the algorithm. This clipping is equivalent to constraining the magnitude of the gradient used to update the network parameters on each iteration of the algorithm described below, since the error term in question comprises part of the derivative of the loss function with respect to the parameters. With such a constraint on the magnitude of the gradient, each parameter update can only have a minimal effect on the network's approximation of the optimal action-value function. Hence, it is not surprising that this clipping technique improves the stability of the algorithm. <br />
<br />
=== Data & Preprocessing ===<br />
<br />
The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the <math>m</math> most recent frames (here <math>m=4</math>), and these 84 x 84 x 4 images are the inputs to the network. To be more specific, the network takes the last four frames as input and outputs the action value of each action.<br />
<br />
== Model Architecture ==<br />
<br />
The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, <math>Q</math>, is estimated by a deep convolutional network, and is updated at every step in time.<br />
<br />
The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. [http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf What is the best multi-stage architecture for object recognition?] Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.<br />
<br />
[[File:Network_Architecture.JPG | center]]<br />
<br />
== Training ==<br />
<br />
=== Framework and Additional Setup Details === <br />
<br />
Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. <math>C</math> or <math>m</math> as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The goal is to use minimal prior knowledge and perform end-to-end training of these models based on game experience.<br />
<br />
The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. For example, the same learning rate can be used across all games. However, the game itself is not otherwise changed. <br />
<br />
There is also a frame-skipping technique employed, in which the agent only performs an action every <math>k^{th}</math> frame to allow the agent to play <math>k</math> times more games, as the network does not have to be trained on the skipped frames (<math>k=4</math> here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame. <br />
<br />
The agents are trained on 50 million frames of game play, which is about 38 days. The experience replay memory used contains the previous 1 million frames of game play. The RMSProp <ref>Hinton, Geoffrey et al.[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Overview of Minibatch Gradient Descent.] University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.<br />
<br />
=== Algorithm Background ===<br />
<br />
==== Markovian Decision Process ====<br />
<br />
A Markovian Decision Process is described by:<br />
<br />
* A set of states <math>S</math><br />
* A set of actions <math>A</math><br />
* Stochastic transition <math>p(s, a, s')</math>, describing the stochastic system<br />
* Cost function <math>c: S X A \mapsto R</math><br />
* Policy <math>\pi^{*}: S \mapsto A </math><br />
<br />
The objective is to find the optimal policy <math>\pi</math> that minimizes the cost function <math>c</math>.<br />
<br />
<br />
==== Q Learning ====<br />
<br />
Q learning is a reinforcement learning technique that is used to find an optimal policy (sequence of actions) for any Markov Decisoin Process (MDP). The Q learning algorithm at its core computes the best sequence of future actions based on previous states. It is an incremental Dynamic Programming procedure where all possible states have to be computed and stored in a table to find an optimal policy. <br />
<br />
<math>Q_{k+1}(s, a) := (1 - \alpha) Q(s, a) + \alpha(c(s, a) + \gamma \min_{b} Q_{k}(S^{'}, b))</math><br />
<br />
Where:<br />
<br />
* <math>s</math>: state where the transition starts<br />
* <math>a</math>: action applied<br />
* <math>s'</math>: resulting state<br />
* <math>\alpha</math>: learning parameter<br />
* <math>\gamma</math>: discount factor (between 0 (short-term) and 1 (long-term))<br />
<br />
Typically Q-learning is performed on-line. Riedmiller (2005) <ref>Riedmiller, Martin. [http://www.damas.ift.ulaval.ca/_seminar/filesA07/articleCharles.pdf "Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method."] Machine Learning: ECML 2005. Springer Berlin Heidelberg, 2005. 317-328.</ref> argues that in principle while Q-learning can be directly implemented in a neural network, the problem with online approach is it creates unstable results. Intuitively this makes sense because if Q-learning were to be used for controls, 1 sample point is not indicative whether the action applied is optimal or not. For reasons listed above Mnih et al (2015) <ref>Mnih, Volodymyr, et al. [http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf "Human-level control through deep reinforcement learning."] Nature 518.7540 (2015): 529-533.</ref> devised a Q-learning updates on minibatches of experience.<br />
In real world the Markov assumption is often violated and we need a modification for Q-learning. One approach for capturing dynamics of many real world environments is using Partially Observable Markov Decision Process (POMDP)<ref>Hausknecht M and Stone P. [http://arxiv.org/pdf/1507.06527v3.pdf "Deep Recurrent Q-Learning for Partially Observable MDPs"] Nature 518.7540 (2015): 529-533.</ref>.<br />
<br />
==== Q-Iteration Learning ====<br />
<br />
At each step in time, the agent selects an action <math>a_t\,</math> from the set of legal game actions <math>\mathbb{A}</math>. The agent observes an image <math>x_t \in \mathbb{R}^d</math> from the emulator, along with a reward <math>r_t\,</math>. It is impossible to fully understand the current game situation from a single screen, so a sequence of actions and observations <math>s_t = x_1, a_1, x_2, \ldots, a_{t-1}, x_t</math> is the input state. <br />
<br />
Recall that if we define <math>R_t = \sum_{t'=t}^T \gamma^{t'-t}r_t</math>, where <math>\gamma\,</math> is the discount factor and <math>\,T</math> is the step in which the game terminates, then <br />
<br />
:<math>Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[R_t| s_t=s, a_t=a, \pi\right]</math><br />
<br />
is the optimal action-value function. <br />
<br />
==== The Bellman Equation in the Loss Framework ====<br />
<br />
The optimal action-value function obeys the Bellman Equation:<br />
<br />
:<math>Q^*\left(s,a\right) = \mathop{\mathbb{E}_{s'}}\left[r + \gamma \max_{a'}Q^*\left(s',a'\right) | s, a\right] </math><br />
<br />
The intuition behind this identity is as follows: if the optimal value <math>Q^*(s',a')\,</math> at the next time step was known for all possible actions <math>a'</math>, then the optimal strategy is to select the action <math>a'</math> maximizing the expected value above <ref name = "main"> </ref>. Using the Bellman Equation as an iterative update formula is impractical, however, since the action-value function is estimated separately for each sequence and cannot generalize.<br />
<br />
It is necessary, in practice, to operate with an approximation of the action-value function. When a neural network with weights <math>\,\theta</math> is used, it is referred to as a Q-Network. A Q-Network is trained by adjusting <math>\,\theta_t</math> to reduce the mean-squared error in the Bellman Equation. The new target values for training are given by <math>y = r + \gamma\max_{a'} Q\left(s', a'; \theta_t^-\right)</math>, where <math>\theta_t^-\,</math> are the parameters from some previous iteration.<br />
<br />
=== The Full Algorithm === <br />
<br />
Now, with all the background behind us, the full Deep Q-Learning with Experience Replay algorithm is presented below:<br />
<br />
<br />
[[File:QLearning_Alg.JPG]]<br />
<br />
<br />
Some notes about the algorithm:<br />
* Replay memory is used to implement the experience replay technique described above, but only the last N experience tuples are stored in the replay memory, and sampled uniformly at random from D when performing updates. They believe that a more sophisticated sampling strategy might emphasize transitions from which we can learn the most, instead of giving equal importance to all transitions in the replay memory.<br />
* An episode is one try in a game<br />
* Correlations between target values and the action function <math>Q\,</math>are mitigated by using <math>\hat{Q}</math> for the target values, where <math>\hat{Q}</math> is the target action-value function, which updated only once every <math>\,C</math> steps.<br />
* The gradient used in the algorithm is defined as <math> \nabla_{\theta_i} L_i(\theta_i) = \mathbb E_{(s,a,r,s')}[(r+\gamma \max_{a'} Q(s',a';\theta_i^-)-Q(s,a;\theta_i)) \nabla_{\theta_i} Q(s,a;\theta_i)] </math>, adjusted to accommodate the use of stochastic gradient descent (meaning that the full expectations are not computed). <br />
<br />
== Results ==<br />
<br />
=== Evaluation Procedure === <br />
<br />
The trained networks played each game 30 times, up to 5 minutes at a time. The random agent, which is the baseline comparison, chooses a random action every 6 frames (10 Hz). The human player uses the same emulator as the agents, and played under controlled conditions (most notably without sound). The human performance is the average reward from around 20 episodes of the game lasting up to 5 minutes, after 2 hours of practice playing each game. The human performance is set to be 100%, and the random agent has performance set to 0%.<br />
<br />
=== Raw Score Results ===<br />
<br />
The DQN agent outperforms the best existing reinforcement learning methods on 43 of the games without incorporating prior knowledge about Atari games. Furthermore, the agent scores at least 75% of the human score on more than half of the games. Also, DQN performs well across a number of types of games. However, games which involve extended planning strategies still pose a major problem to DQN (e.g. Montezuma's Revenge). These results are visualized in the figure below: <br />
<br />
[[File:Performance.JPG ]]<br />
<br />
=== t-SNE ===<br />
The researchers also explored the representation learned by DQN that underpinned the most successful agent in the context of the game Space Invaders by using a technique developed for visualizing high-dimensional data called 't-SNE'. As expected, the t-SNE algorithm tends to map the DQN representation of perceptually similar states to nearby points. It was also discovered that t-SNE created similar embeddings for DQN representations that were close in terms of expected reward but perceptually dissimilar. This is consistent with the notion that the network can learn representations that support adaptive behaviour from high-dimensional sensory inputs. See the figure below for a depiction of the t-SNE.<br />
<br />
[[File:tSNE.JPG ]]<br />
<br />
=== Results with model components removed ===<br />
<br />
Two important advances in this paper were presented: experience replay and creating a separate network to evaluate the targets. To visualize the impact of these advances, the network was trained with and without both of these concepts, and evaluated on its performance in each case. The results are shown in the table below, in percentage form as above:<br />
<br />
{| class="wikitable"<br />
|-<br />
! Game<br />
! With Replay and Target Q<br />
! With Replay, Without Target Q<br />
! Without Replay, With Target Q<br />
! Without Replay, Without Target Q<br />
|-<br />
| Breakout<br />
| 316.8<br />
| 240.7<br />
| 10.2<br />
| 3.2<br />
|-<br />
| Enduro<br />
| 1006.3<br />
| 831.4<br />
| 141.9<br />
| 29.1<br />
|- <br />
| River Raid<br />
| 7446.6<br />
| 4102.8<br />
| 2867.7<br />
| 1453.0<br />
|-<br />
| Seaquest<br />
| 2894.4<br />
| 822.6<br />
| 1003.0<br />
| 275.8<br />
|-<br />
| Space Invaders<br />
| 1088.9<br />
| 826.3<br />
| 373.2<br />
| 302.0<br />
|}<br />
<br />
Clearly, experience replay and maintaining a secondary network for computing target values are important. From these results, it seems that experience replay is more important on its own, except in Seaquest.<br />
<br />
== Conclusion == <br />
<br />
The framework presented has demonstrated the ability to learn how to play Atari games, given minimal prior knowledge of the game and very basic inputs. Using reinforcement learning with the Q-network architecture was more effective than previous similar attempts, since experience replay and a separate target network were utilized in training. These two modifications removed correlations between sequential inputs, which improved stability in the network. Future work should be undertaken to improve the experience replay algorithm: instead of sampling uniformly from the replay memory, the sampling should be biased towards high-reward events. However, this may add a layer of instability to the network, but it is certainly worth investigating.<br />
<br />
== Discussion ==<br />
<br />
* Using simulators for reinforcement learning is very similar to Evolutionary Robotics, where instead of using Q-learning variants to learn neural networks, evolutionary algorithms are used to evolve neural network topology and weights <ref>Nolfi, Stefano, and Dario Floreano. Evolutionary robotics: The biology, intelligence, and technology of self-organizing machines. MIT press, 2000.</ref>. [https://www.youtube.com/watch?v=qv6UVOQ0F44 MarI/O] for example by Seth Bling, used an NeuroEvolutionary algorithm to evolve a Neural Network to play a level in Mario video game, although the Neural Network trained is very different, the experimental evaluation is very similar to the one described by Mnih et al (2015). <br />
<br />
* The problem of using simulators for solving control problems is that they do not translate well in the real world engineering <ref>Sofge, D. A., et al. [http://arxiv.org/pdf/0706.0457.pdf "Challenges and opportunities of evolutionary robotics."] arXiv preprint arXiv:0706.0457 (2007).</ref>, additionally Q-iteration makes the assumption that the state of the environment is fully observable.<br />
<br />
* A lot of the games that the system performed poorly on are adventure games that require semantic understanding of what's being seen and the ability to do induction. For example, the complex path finding and puzzle solving of Pitfall. Obviously, Q-learning is incapable of accomplishing this alone and must be embedded in a cognitive system.<br />
<br />
Generally, the popular Q-learning algorithm is known to overestimate action values under certain conditions. The introduced setting is a best case scenario for Q-learning in some cases, because the deep neural network results in flexible function approximation which has a low asymptotic approximation error. Plus, the determinism of the environments prevents the harmful effects of noise. Surprisingly, it has been shown that even in this setting DQN sometimes overestimates the values of the actions. Double Q-learning can provide a solution to this issue. Double DQN not only yields more accurate value estimates, but results in much higher scores on several games. This demonstrates that the overestimations of DQN were indeed leading to poorer policies and that double DQN is beneficial to reduce them.<ref><br />
Hasselt, H. V., et al. [http://arxiv.org/pdf/1509.06461.pdf " Deep reinforcement learning with double Q_learning." arXiv preprint arXiv:1509.06461v1 (2015).</ref><br />
<br />
== Bibliography ==<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=overfeat:_integrated_recognition,_localization_and_detection_using_convolutional_networks&diff=25624overfeat: integrated recognition, localization and detection using convolutional networks2015-10-29T23:15:06Z<p>Mgohari2: </p>
<hr />
<div>= Introduction =<br />
<br />
Recognizing the category of the dominant object in an image is a task to which Convolutional Networks (ConvNets) have been applied for many years. ConvNets have advanced the state of the art on large datasets such as 1000-category ImageNet<br />
<ref name=DeJ><br />
Deng, Jia, ''et al'' [http://www.image-net.org/papers/imagenet_cvpr09.pdf "ImageNet: A Large-Scale Hierarchical Image Database."] in CVPR09, (2009).<br />
</ref>.<br />
<br />
Many image datasets include images with a roughly centered object that fills much of the image. Yet, objects of interest sometimes vary significantly in size and position within the image.<br />
<br />
The first idea in addressing this is to apply a ConvNet at multiple locations in the image, in a sliding window fashion, and over multiple scales <br />
<ref name=MaO><br />
Matan, Ofer, ''et al'' [http://yann.lecun.com/exdb/publis/pdf/matan-92.pdf "IReading handwritten digits: A zip code recognition system."] in IEEE Computer, (1992).<br />
</ref><br />
<ref name=NoS><br />
Nowlan, Steven, ''et al'' [http://research.microsoft.com/pubs/68392/cnnHand.pdf "A convolutional neural network hand tracker."] in IEEE Computer, (1995).<br />
</ref>. Even with this, however, many viewing windows may contain a perfectly identifiable portion of the object (say, the head of a dog), but not the entire object, nor even the center of the object. This leads to decent classification but poor localization and detection.<br />
<br />
The second idea is to train the system to not only produce a distribution over categories for each window, but also to produce a prediction of the location and size of the bounding box containing the object relative to the window. <br />
<br />
The third idea is to accumulate the evidence for each category at each location and size.<br />
<br />
This research shows that training a convolutional network that simultaneously classifies, locates and detects objects in images can boost the classification, detection and localization accuracy of all tasks. The paper proposes a new integrated approach to object detection, recognition, and localization with a single ConvNet. A novel method for localization and detection by accumulating predicted bounding boxes is also introduced. They suggest that by combining many localization predictions, detection can be performed without training on background samples and that it is possible to avoid the time-consuming and complicated bootstrapping training passes. Not training on the background also lets the network focus solely on positive classes for higher accuracy.<br />
<br />
This paper is the first to provide a clear explanation how ConvNets can be used for localization and detection for ImageNet data.<br />
<br />
= Vision Tasks =<br />
This research explores three computer vision tasks in increasing order of difficulty (each task is a sub-task of the next): <br /><br />
(i) classification, (ii) localization, and (iii) detection.<br /><br />
In the classification task, each image is assigned a single label corresponding to the main object in the image. Five guesses are allowed to find the correct answer (because images can also contain multiple unlabeled objects). For the localization task, in addition to classifying five objects in the image, a bounding box for each classified object is returned. The predicted box must match the groundtruth by at least 50% (using the PASCAL criterion of union over intersection), as well as be labeled with the correct class.<br />
Images from 2013 ImageNet Large Scale Visual Recognition Challenge (ILSVRC2013) is used for this research. The detection task differs from localization in that there can be any number of objects in each image (including zero), and false positives are penalized by the mean average precision measure ([http://en.wikipedia.org/wiki/Information_retrieval#Mean_average_precision mAP]). Figure 1 illustrates the higher difficulty of the detection process.<br />
<br />
<center><br />
[[File:Im_2.PNG | frame | center |Figure 1. This image illustrates the higher difficulty of the detection dataset, which can contain many small objects while the classification and localization images typically contain a single large object. ]]<br />
</center><br />
<br />
= Classification =<br />
<br />
== Model Design and Training ==<br />
<br />
During the ''train ''phase, this model uses the same fixed input size approach proposed by Krizhevsky ''et al''<br />
<ref name=KrA><br />
Krizhevsky, Alex, ''et al'' [http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf "ImageNet Classiﬁcation with Deep Convolutional Neural Networks."] in NIPS (2012).<br />
</ref>.<br />
This model maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution.As depicted in Figure 2, this network contains eight layers with weights; the ﬁrst ﬁve are convolutional and the remaining three are fully-connected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels. This network maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution. <br />
<br />
Each image is downsampled so that the smallest dimension is 256 pixels. Then five random crops (and their horizontal flips) of size 221x221 pixels are extracted and presented to the network in mini-batches of size 128. The weights in the network are initialized randomly. They are then updated by stochastic gradient descent. Overﬁtting can be reduced by using “DropOut”<br />
<ref name=HiG><br />
Hinton, Geoffrey, ''et al'' [http://arxiv.org/pdf/1207.0580.pdf "Improving neural networks by preventing co-adaptation of feature detectors."] arXiv:1207.0580, (2012). <br />
</ref><br />
to prevent complex co-adaptations on the training data. On each presentation of each training case, each hidden unit is randomly omitted from the network with a probability of 0.5, so a hidden unit cannot rely on other hidden units being present. DropOut is employed on the fully connected layers (6th and 7th) in the classifier. For ''training'' phase, multiple GPUs are used to increase the computation speed.<br />
<br />
<center><br />
[[File:Im_1.PNG | frame | center |Figure 2. An illustration of the architecture of this CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the ﬁgure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000. ]]<br />
</center><br />
<br />
For ''test'' phase, the entire image is explored by densely running the network at each location and at multiple scales. This approach yields significantly more views for voting, which increases robustness while remaining efficient.<br />
For resolution augmentation, 6 scales of input are used which result in unpooled layer 5 maps of varying resolution. These are then pooled and presented to the classifier using the following procedure,<br />
<br />
== Multi-Scale Classification ==<br />
<br />
In Krizhevsky's work<br />
<ref name=KrA><br />
Krizhevsky, Alex, ''et al'' [http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf "ImageNet Classiﬁcation with Deep Convolutional Neural Networks."] in NIPS (2012).<br />
</ref>, their architecture can only produce a classification vector every 36 pixels in the input dimension along each axis, which decreases performance since the network windows are not aligned with the objects very well.<br />
<br />
To solve this problem, they apply the last subsampling operation at every offset, similar to the approach introduced by Giusti et al.<br />
<ref name=GiC><br />
Giusti A, Cireşan D C, Masci J, ''et al''. <br />
[http://arxiv.org/pdf/1302.1700.pdf "Fast image scanning with deep max-pooling convolutional neural networks."] arXiv:1302.1700, (2013).<br />
</ref><br />
.<br />
<br />
(a). For a single image, at a given scale, we start with the unpooled layer 5 feature maps.<br /><br />
(b). Each of unpooled maps undergoes a 3x3 max pooling operation (non-overlapping regions), repeated 3x3 times for <math>(\Delta x,\Delta y)</math> pixel offsets of {0, 1, 2}.<br /><br />
(c). This produces a set of pooled feature maps, replicated (3x3) times for different <math>(\Delta x,\Delta y)</math> combinations.<br /><br />
(d). The classifier (layers 6,7,8) has a fixed input size of 5x5 and produces a C-dimensional output vector for each location within the pooled maps. The classifier is applied in sliding window fashion to the pooled maps, yielding C-dimensional output maps (for a given <math>(\Delta x,\Delta y)</math> combination).<br /><br />
(e). The output maps for different <math>(\Delta x,\Delta y)</math> combinations are reshaped into a single 3D output map (two spatial dimensions x C classes).<br />
<br />
<center><br />
[[File:Im_3.PNG | frame | center |Figure 3. 1D illustration (to scale) of output map computation for classification. (a): 20 pixel unpooled layer 5 feature map. (b): max pooling over non-overlapping 3 pixel groups, using offsets of \Delta = {0, 1, 2} pixels (red, green, blue respectively). (c): The resulting 6 pixel pooled maps, for different \Delta . (d): 5 pixel classifier (layers 6,7) is applied in sliding window fashion to pooled maps, yielding 2 pixel by C maps for each \Delta. (e): reshaped into 6 pixel by C output maps. ]]<br />
</center><br />
<br />
These operations can be viewed as shifting the classifier’s viewing window by 1 pixel through pooling layers without subsampling and using skip-kernels in the following layer (where values in the neighborhood are non-adjacent).<br />
<br />
The procedure above is repeated for the horizontally flipped version of each image. The final classification is produced by <br />
(I) Taking the spatial max for each class, at each scale and flip.<br />
(II) Averaging the resulting C-dimensional vectors from different scales and flip. <br />
(III) Taking the top-1 or top-5 elements (depending on the evaluation criterion) from the mean class vector.<br />
<br />
In the feature extraction part (1-5 layers) of this ConvNets, the filters are convolved across the entire image in one pass since this is more efficient to detect the features from different locations. In classifier part (6-output), however, the exhaustive pooling scheme is applied to obtain fine alignment between the classifier and the representation of the object in the feature map.<br />
<br />
The approach described above, with 6 scales, achieves a top-5 error rate of 13.6%. As might be expected, using fewer scales hurts performance: the singlescale model is worse with 16.97% top-5 error. The fine stride technique illustrated in Figure 3 brings a relatively small improvement in the single-scale method, but is also of importance for the multi-scale gains shown here.<br />
<br />
= Localization =<br />
<br />
For localization, the classification-trained network is modified. to do so, classifier layers are replaced by a regression network and then trained to predict object bounding boxes at each spatial location and scale. Then regression predictions are combined together, along with the classification results at each location.<br />
<br />
Classifier and regressor networks are simultaneously run together across all locations and scales. The output of the final softmax layer for a class c at each location provides a score of confidence that an object of class c is present in the corresponding field of view. So, a confidence can be assigned to each bounding box.<br />
<br />
The regression network takes the pooled feature maps from layer 5 as input and the final output layer has 4 units which specify the coordinates for the bounding box edges.<br />
The regression network is trained using an ''l<sub>2</sub>'' loss between the predicted and true bounding box for each example. The final regressor layer is class-specific, having 1000 different versions, one for each class.<br />
<br />
The individual predictions are combined via a greedy merge strategy applied to the regressor bounding boxes, using the following algorithm:<br /><br /><br />
(a) Assign to ''C<sub>s</sub>'' the set of classes in the top ''k'' for each scale ''s'' <math>\in</math> 1 . . . 6, found by taking the maximum detection class outputs across spatial locations for that scale.<br /><br />
(b) Assign to ''B<sub>s</sub>'' the set of bounding boxes predicted by the regressor network for each class in ''C<sub>s</sub>'', across all spatial locations at scale ''s''.<br /><br />
(c) Assign <math>B \leftarrow \cup _{s} B_{s} </math> <br /><br />
(d) Repeat merging until done.<br /><br />
(e) <math> (b_{1}^*,b_{2}^*) = argmin_{b_{1} \neq b_{2} \in B} MatchScore (b_{1},b_{2})</math><br /><br />
(f) If <math> MatchScore (b_{1}^*,b_{2}^*)>t</math> , stop.<br /><br />
(g) Otherwise, set <math> B \leftarrow B \backslash (b_{1}^*,b_{2}^*) \cup BoxMerge (b_{1}^*,b_{2}^*) </math><br /><br />
<br />
In the above, we compute <math>MatchScore</math> using the sum of the distance between centers of the two bounding boxes and the intersection area of the boxes. <math>BoxMerge</math> computes the average of the bounding boxes’ coordinates. The final prediction is given by taking the merged bounding boxes with maximum class scores. This is computed by cumulatively adding the detection class outputs associated with the input windows from which each bounding box was predicted.<br />
<br />
This network was applied to the Imagenet 2012 validation set and 2013 localization competition. Localization criterion specified for these competitions was applied to the method. This method is the winner of the 2013 competition with 29.9%.error.<br />
<br />
= Detection =<br />
<br />
Detection training is similar to classification training but in a spatial manner. Multiple locations of an image may be trained simultaneously. Since the model is convolutional, all weights are shared among all locations. The main difference with the localization is the necessity to predict a background class when no object is present. <br />
Traditionally, negative examples are initially taken at random for training. Then the most offending negative errors are added to the training set in bootstrapping passes. Independent bootstrapping passes render training complicated and risk potential mismatches between the negative examples collection and training times. Additionally, the size of bootstrapping passes needs to be tuned to make sure training does not overfit on a small set. To circumvent all these problems, we perform negative training on the fly, by selecting a few interesting negative examples per image such as random ones or most offending ones. This approach is more computationally expensive but renders the procedure much simpler. And since the feature extraction is initially trained with the classification task, the detection fine-tuning is not as long anyway.<br />
<br />
This detection system ranked 3rd with 19.4% mean average precision (mAP) at ILSVRC 2013. In post competition work, with a few modifications, this method achieved a new state of the art with 24.3% mAP. This technique speeds up inference and substantially reduces the number of potential false positives.<br />
<br />
= Conclusion =<br />
<br />
This research presented a multi-scale, sliding window approach that can be used for classification, localization, and detection. This method currently ranks 4th in classification, 1st in localization and 1st in detection at 2013 ILSVRC competition, which proves that ConvNets can be effectively used for detection and localization tasks. The scheme proved here involves substantial modifications to networks designed for classification, but clearly demonstrate that ConvNets are capable of these more challenging tasks. This localization approach won the 2013 ILSVRC competition and significantly outperformed all 2012 and 2013 approaches. The detection model was among the top performers during the competition and ranks first in post-competition results. This research presented an integrated pipeline that can perform different tasks while sharing a common feature extraction base, entirely learned directly from the pixels.<br />
The overfeat CNN detector, is very scalable, and simulates a sliding window detector in a single forward pass in the network by efficiently reusing convolutional results on each layer. Overfeat converts an image recognition CNN into a “sliding window” detector by providing a larger resolution image and transforming the fully connected layers into convolutional layers. To be more specific, a sliding window detector is one in which you first train a classifier (e.g. a neural network) on centered images, then you apply the classifier at every possible location in the target image. The possible locations are generally tried from left to right, top to bottom, in a nested for loop, so it is called a "sliding window" (the widow of the classifier is "slid over" the image in search of a match). It is a very slow and inefficient process.A convolution is similar to a sliding window except all locations are processed (theoretically) in parallel subject to a mathematical formalism.<br />
<br />
= Discussion =<br />
<br />
This approach might still be improved in several ways: <br /><br />
(I) For localization, back-propping is not used through the whole network; doing so is likely to improve performance. <br /><br />
(II) ''l<sub>2</sub>'' loss is used, rather than directly optimizing the intersection-over-union (IOU) <ref name=iou><br />
M. Everingham, L. V. Gool, C. K. Williams, J. Winn, and A. Zisserman. [http://www.image-net.org/papers/imagenet_cvpr09.pdf "ImageNet: A Large-Scale Hierarchical Image Database."] The PASCAL voc2012 challenge results.<br />
</ref>. criterion on which performance is measured. Swapping the loss to this should be possible since IOU is still differentiable, provided there is some overlap. <br /><br />
(III) Alternate parameterizations of the bounding box may help to decorrelate the outputs, which will aid network training.<br />
<br />
= Resources =<br />
<br />
The OverFeat model has been made publicly available on [https://github.com/sermanet/OverFeat GitHub]. It contains a C++ implementation, as well as an API for Python and Lua.<br />
<br />
=References=<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=goingDeeperWithConvolutions&diff=25621goingDeeperWithConvolutions2015-10-29T22:34:23Z<p>Mgohari2: </p>
<hr />
<div>= Introduction =<br />
In the last three years, due to the advances of deep learning and more concretely convolutional networks [http://white.stanford.edu/teach/index.php/An_Introduction_to_Convolutional_Neural_Networks [an introduction of CNN]] , the quality of image recognition has increased dramatically. The error rates for ILSVRC competition dropped significantly year-by-year [http://image-net.org/challenges/LSVRC/ [LSVRC]]. This paper<ref name=gl><br />
Szegedy, Christian, et al. [http://arxiv.org/pdf/1409.4842.pdf "Going deeper with convolutions."] arXiv preprint arXiv:1409.4842 (2014).<br />
</ref> propose a new deep convolutional neural network architecture code-named ''Inception''. With the Inception module and carefully-crafted design, researchers build a 22 layers deep network called Google Lenet, which uses 12 times less parameters while being significantly more accurate than the winners of ILSVRC 2012.<ref name=im><br />
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. [http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf "Imagenet classification with deep convolutional neural networks."]<br />
Advances in neural information processing systems. 2012.<br />
</ref><br />
<br />
= Motivation =<br />
<br />
The authors main motivation for their "Inception module" approach to CNN architecture is due to:<br />
<br />
* Large CNN networks may allow for more expressive power, however it is also prone to over fitting due to the large number of parameters.<br />
* Uniform increased network size increases computational resources.<br />
* Sparse networks are theoretically possible, but sparse data structures are very inefficient.<br />
<br />
== Why Sparse Matrices are inefficient == <br />
<br />
Sparse matrices if stored as Compressed Row Storage (CSR) format or any other common sparse matrix data structures can be expensive to access at specific indices of the matrix. This is because they use an indirect addressing step for every single scalar operation in a matrix-vector product. (source: [http://netlib.org/linalg/html_templates/node91.html#SECTION00931100000000000000 netlib.org manual]) Because of this indirection the values are not stored in one contiguous memory location and a lot of cache misses occur which slow down the calculation as data has to be fetched from the slower main memory.<br />
<br />
For example a non-symmetric sparse matrix A in CRS format defined by:<br />
<br />
[[File:crs_sparse_matrix.gif]]<br />
<br />
The CRS format for this matrix is then specified by arrays `{val, col_ind,<br />
row_ptr}` given below:<br />
<br />
[[File:crs_sparse_matrix_details_1.gif]]<br />
<br />
[[File:crs_sparse_matrix_details_2.gif]]<br />
<br />
= Background =<br />
<br />
The paper assumes that the reader is familiar with Convolutional Neural Networks (CNNs). This section gives a short introduction [http://deeplearning.net/tutorial/lenet.html based on this tutorial].<br />
<br />
In a CNN the layers are sparsely connected. Each unit in layer <math>m</math> receives only inputs from part of layer <math>m-1</math>. This inputs are usually from spatially neighboring units and the connection weights to each unit in layer <math>m</math> are shared. This reduces the amount of computation (because the layers are not fully connected) and reduces the amount of parameters to learn so that less training data is required and the network is less prone to overfitting. Because this structure is similar to a convolution in the sense that the same set of weights get shifted over the input and applied at each location, networks with this structure are called convolutional neural networks.<br />
<br />
[[File:Conv_1D_nn.png]]<br />
<br />
A related approach is ''max-pooling'', which is a non-linear down-sampling[http://deeplearning.net/tutorial/lenet.html]. Instead of weighting the input values and summing them, as in a standard artificial neural network, the maximum value out of the max-pooling window in picked. By picking one value out of <math>n \times n</math> (<math>3 \times 3 = 9</math> in the paper) the amount of computation for the higher layer gets reduced. Furthermore, a form of translation invariance is provided because shifting the max-pooling window by one position will likely still give the same maximum.<br />
<br />
= Related work =<br />
<br />
In 2013 Lin et al.<ref name=nin><br />
Min Lin, Qiang Chen and Shuicheng Yan. [http://arxiv.org/pdf/1312.4400v3.pdf Network in Network]<br />
</ref> pointed out that the convolution filter in CNN is a generalized linear model (GLM) for the underlying data patch and the level of abstraction is low with GLM. They suggested replacing GLM with a ”micro network” structure which is a general nonlinear function approximator.<br />
<center><br />
[[File:Nin.png | frame | center |Fig 1. Comparison of linear convolution layer and mlpconv layer ]]<br />
</center><br />
<br />
Also in this paper<ref name=nin></ref> Lin et al. proposed a new output layer to improve the performance. In tradition the feature maps of the last convolutional layers are vectorized and fed into a fully connected layers followed by a softmax logistic regression layer<ref name=im></ref>. Lin et al. argued that this structure is prone to overfitting, thus hampering the generalization ability of the overall network. To improve this they proposed another strategy called global average pooling. The idea is to generate one feature map for each corresponding category of the classification task in the last mlpconv layer, take the average of each feature map, and the resulting vector is fed directly into the softmax layer. Lin et al. claim that their strategy has the following advantages over fully connected layers. First, it is more native to the convolution structure by enforcing correspondences between feature maps and categories. Thus the feature maps can be easily interpreted as categories confidence maps. Second, there is no parameter to optimize in the global average pooling thus overfitting is avoided at this layer. Furthermore, global average pooling sums out the spatial information, thus it is more robust to spatial translations of the input.<br />
<br />
= Inception module =<br />
The Inception architecture in "Going deeper with convolutions", Szegedy, Christian, et al. is based on two main ideas: The approximation of a sparse structure with spatially repeated dense components and using dimension reduction to keep the computational complexity in bounds, but only when required.<br />
<br />
Convolutional filters with different sizes can cover different clusters of information. By finding the optimal local construction and repeating it spatially (assuming translational invariance), they approximate the optimal sparse structure with dense components. For convenience of computation, they choose to use 1 x 1, 3 x 3 and 5 x 5 filters. Additionally, since pooling operations have been essential for the success in other state of the art convolutional networks, they also add pooling layers in their module. Together these made up the naive Inception module.<br />
<center><br />
[[File:i1.png | frame | center |Fig 2. Inception module, naive version ]]<br />
</center><br />
<br />
Stacking these inception modules on top of each would lead to an exploding number of outputs. Thus, they use, inspired by "Network in Network", Szegedy, Christian, et al., 1x1 convolutions for dimensionality reduction. While this keeps the computational complexity in bounds, it comes with another problem. The low dimensional embeddings represent the data in a compressed and non-sparse form, but the first idea requires sparsity to make the approximation possible. For that reason, the dimensionality reduction step should only be applied when it is required to preserve sparse representations as much as possible. In practice, the author's used the 1x1 convolutions before doing the expensive 3 x 3 and 5 x 5 convolutions. <br />
<center><br />
[[File:i2.png | frame | center |Fig 3. Inception module with dimension reduction ]]<br />
</center><br />
<br />
===Beneficial aspects of Inception module===<br />
* It allows for increasing number of units significantly without an uncontrolled blow-up in computational complexity.<br />
* The ubiquitous use of dimension reduction allows for shielding the large number of input filters of the last stage to the next layer.<br />
* It aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously.<br />
* It can result in networks that are 2-3X faster than similarly performing networks with non-inception architecture. <br />
<br />
= Google Lenet =<br />
The structure of Google Lenet can be seen from this figure.<br />
[[File:Gle.png | 900px |Fig 4. Google Lenet ]]<br />
<br />
“#3 x 3 reduce” and “#5 x 5 reduce” stands for the number of 1 x 1 filters in the reduction layer used before the 3 x 3 and 5 x 5 convolutions. <br />
<br />
In order to encourage discrimination in the lower stages in the classifier and increase the gradient signal that gets propagated back, the authors add auxiliary classifiers connected to the out put of (4a) and (4d). During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3). At inference time, these auxiliary networks are discarded.<br />
<br />
The networks were trained using the DistBelief<ref><br />
Dean, Jeffrey, et al. [http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf "Large scale distributed deep networks."] Advances in Neural Information Processing Systems. 2012.<br />
</ref> distributed machine learning system using modest amount of model and data-parallelism. <br />
<br />
=ILSVRC 2014 Classification Challenge Results=<br />
<br />
The ILSVRC 2014 classification challenge involves the task of classifying the image into one of 1000 leaf-node categories in the Imagenet hierarchy [http://www.image-net.org/challenges/LSVRC/2014/index#introduction ILSVRC]. The training data is the subset of Imagenet containing the 1000 [http://image-net.org/challenges/LSVRC/2014/browse-synsets categories] (approximately 1.2 million images), and the validation and testing data consists of 150,000 hand-labelled photographs (50,000 for validation, the remaining for testing). In calculating the error rate for the table below, an image was considered to be correctly classified if its true class was among the top 5 predicted classes.<br />
<br />
Szegedy, Christian, et al. independently trained 7 versions of the same GoogLeNet model (they only differ in sampling methodologies and the random order in which they see input images), and performed ensemble prediction with them. <br />
<br />
{| class="wikitable"<br />
|-<br />
! Team<br />
! Year<br />
! Place<br />
! Error (top-5)<br />
! Uses external data<br />
|-<br />
| SuperVision<br />
| 2012<br />
| 1st<br />
| 16.4%<br />
| no<br />
|-<br />
| SuperVision<br />
| 2012<br />
| 1st<br />
| 15.3%<br />
| Imagenet 22k<br />
|-<br />
| Clarifi<br />
| 2013<br />
| 1st<br />
| 11.7%<br />
| no<br />
|-<br />
| Clarifi<br />
| 2013<br />
| 1st<br />
| 11.2%<br />
| Imagenet 22k<br />
|-<br />
| MSRA<br />
| 2014<br />
| 3rd<br />
| 7.35%<br />
| no<br />
|-<br />
| VGG<br />
| 2014<br />
| 2nd<br />
| 7.32%<br />
| no<br />
|-<br />
| GoogleLeNet<br />
| 2014<br />
| 1st<br />
| 6.67%<br />
| no<br />
|}<br />
<br />
= Critques = <br />
* It is quite interesting how the authors of this paper tried to address computational efficiency by condensing the network into a more dense form. What the paper perhaps did not mention is whether this computationally efficient dense network has any robustness trade-off. Additionally, it is a bit paradoxical to say that a large network with many parameters may induce over training, while intuitively correct, conversely techniques such as Dropout <ref>Srivastava, Nitish, et al. [https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf "Dropout: A simple way to prevent neural networks from overfitting."] The Journal of Machine Learning Research 15.1 (2014): 1929-1958.</ref> encourage a sparse network representation to avoid over training and increase robustness. Further study is needed to investigate whether one should approach CNN training in a dense, sparse, or happy medium between two architecture types.<br />
<br />
= Resources =<br />
<br />
An implementation with the [http://caffe.berkeleyvision.org/ Caffe] architecture for GoogLeNet is available publicly for unrestricted use on [https://github.com/BVLC/caffe/tree/master/models/bvlc_googlenet GitHub]. This includes a model that can be trained as well as a pre-trained model on the ImageNet dataset. Caffe is a deep learning framework designed to make models quick to build and easy to use.<br />
<br />
=References=<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=parsing_natural_scenes_and_natural_language_with_recursive_neural_networks&diff=25570parsing natural scenes and natural language with recursive neural networks2015-10-26T16:21:50Z<p>Mgohari2: </p>
<hr />
<div>= Introduction = <br />
<br />
<br />
This paper uses Recursive Neural Networks (RNN) to find a recursive structure that is commonly found in the inputs of different modalities such as natural scene images or natural language sentences. This is the first deep learning work which learns full scene segmentation, annotation and classification. The same algorithm can be used both to provide a competitive syntactic parser for natural language sentences from the Penn Treebank and to outperform alternative approaches for semantic scene segmentation, annotation and classification. <br />
<br />
For vision applications, the approach differs from any previous works in that it uses off-the-shelf vision features of segments obtained from oversegmented images instead of learning feature from raw images. In addition, the same network can be used recursively to achieve classification instead of building a hierarchy by a convolutional neural network.<br />
<br />
Also, this particular approach for NLP is different in that it handles variable sized sentences in a natural way and captures the recursive nature of natural language. Furthermore, it jointly learns parsing decisions, categories for each phrase and phrase feature embeddings which capture the semantics of their constituents.This approach captures syntactic and compositional-semantic information that help making more accurate parsing decisions and obtaining the similarities between segments and entire images. <br />
<br />
= Core Idea =<br />
<br />
The following figure describes the recursive structure that is present in the images and the sentences.<br />
<br />
<center><br />
[[File:Pic1.png | frame | center |Fig 1. Illustration of the RNN Parsing Images and Text ]]<br />
</center><br />
<br />
Images are first over segmented into regions which are later mapped to semantic feature vector using a neural network. These features are then used as an input to the RNN, which decides whether or not to merge the neighbouring images. This is decided based on a score which is higher if the neighbouring images share the same class label.<br />
<br />
In total the RNN computed 3 outputs : <br />
* Score, indicating whether the neighboring regions should be merged or not<br />
* A new semantic feature representation for this larger region<br />
* Class label<br />
<br />
The same procedure is applied to parsing of words too. The semantic features are given as an input to the RNN, then they are merged into phrases in a syntactically and semantically meaningful order.<br />
<br />
= Input Representation =<br />
<br />
Each image is divided into 78 segments, and 119 Features(described by Gould et al.<ref><br />
Gould, Stephen, Richard Fulton, and Daphne Koller. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5459211&tag=1 "Decomposing a scene into geometric and semantically consistent regions."] Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009.<br />
</ref>) from each segment are extracted. These features include color and<br />
texture features , boosted pixel classifier scores (trained on the labelled training data), as well as appearance and shape features. <br />
<br />
Each of these images are then transformed semantically by applying it to a neural network layer using a logistic function as the activation unit as follows:<br />
<br />
''<math>\,a_i=f(W^{sem}F_i + b^{sem})</math>''<br />
<br />
where W is the weight that we want to learn, F is the Feature Vector, b is the bias and f is the activation function. In this version of experiments, the original sigmoid function <math>f(x)=\tfrac{1}{1 + e^{-x}}</math> was used.<br />
<br />
For the sentences each word is represented by an <math>n</math>-dimensional vector (n=100 in the paper). The values of these vectors are learned to capture co-occurrence statistics and they are stored as columns in a matrix <math>L \in \mathbb{R}^{n \times |V|}</math> where <math>|V|\,</math> is the size of the vocabulary (i.e., the total number of unique words that might occur). To extract the features or semantic representation of a word a binary vector <math>\,e_k</math> with all zeros except for the <math>\,k^{th}</math> index can be used, where <math>\,k</math> corresponds to the word's column index in <math>\,L</math>. Given this vector, the semantic representation of the word is obtained by<br />
<br />
''<math>a_i=Le_k\,</math>''.<br />
<br />
= Recursive Neural Networks for Structure Prediction =<br />
<br />
In our discriminative parsing architecture, the goal is to learn a function ''f : X → Y,'' where Y is the set of all possible binary parse trees. <br />
An input x consists of two parts: (i) A set of activation vectors <math>\{a_1 , . . . , a_{N_{segs}}\}</math>, which represent input elements such as image segments or words of a sentence. (ii) A symmetric adjacency matrix ''A'', where ''A(i, j) = 1,'' if segment i neighbors j. This matrix defines which elements can be merged. For sentences, this matrix has a special form with 1’s only on the first diagonal below and above the main diagonal.<br />
<br />
The following figure illustrates how the inputs to RNN look like and what the correct label is. For images, there will be more than one correct binary parse tree, but for sentences there will only have one correct tree. A correct tree means that segments belong to the same class are merged together into one superclass before they get merged with other segments from different superclasses.<br />
<center><br />
[[File:pic2.png | frame | center |Fig 2. Illustration of the RNN Training Inputs]]<br />
</center><br />
<br />
The structural loss margin for RNN to predict the tree is defined as follows<br />
<br />
<math>\Delta(x,l,y^{\mathrm{proposed}})=\kappa \sum_{d \in N(y^{\mathrm{proposed}})} 1\{subTree(d) \notin Y (x, l)\}</math><br />
<br />
where the summation is over all non terminal nodes and <math>\Kappa</math> is a parameter. ''Y(x,l'') is the set of correct trees corresponding to input ''x'' and label ''l''. <math>1\{\dots\}</math> will be one for a non-empty set and zero for an empty set. The loss increases when a segment merges with another one of a different label before merging with all its neighbors of the same label. To express this in somewhat more natural terms, any subtree that does not occur in any of the ground truth trees will increase the loss by one.<br />
<br />
Given the training set, the algorithm will search for a function f with small expected loss on unseen inputs, i.e. <br />
[[File:pic3.png]]<br />
where θ are all the parameters needed to compute a score s with an RNN. The score of a tree y is high if the algorithm is confident that the structure of the tree is correct.<br />
<br />
An additional constraint imposed is that the score of the highest scoring tree should be greater than margin defined y the structural loss function so that the model output's as high score as possible on the correct tree and as low score as possible on the wrong tree.<br />
This constraint can be expressed as <br />
[[File:pic4.png]]<br />
<br />
With these constraints minimizing the following objective function ''maximizes'' the correct tree’s score and minimizes (up to a margin) the score of the highest scoring but incorrect tree. [[File:pic5.png]]<br />
<br />
For learning the RNN structure, the authors used activation vectors and<br />
adjacency matrix as inputs, as well as a greedy approximation since there is no<br />
efficient dynamic programming algorithms for their RNN setting.<br />
<br />
With an adjacency matrix A, neighboring segments are found with the algorithm<br />
and their activations added to a set of potential child node pairs:<br />
<br />
::::<math>\,C = \{ [a_i, a_j]: A(i, j) = 1 \}</math><br />
<br />
So for example, from the image in Fig 2. we would have the following pairs: <br />
<br />
::::<math>\,C = \{[a_1, a_2], [a_1, a_3], [a_2, a_1], [a_2, a_4], [a_3, a_1], [a_3, a_4], [a_4, a_2], [a_4, a_3], [a_4, a_5], [a_5, a_4]\}</math><br />
<br />
Where these are concatenated and given as inputs into the neural network. Potential parent representations for possible child nodes are calculated with:<br />
<br />
::::<math>\,p(i, j) = f(W[c_i: c_j] + b)</math><br />
<br />
<br />
And the local score with:<br />
<br />
::::<math>\,s(i, j) = W^{score} p(i, j) </math><br />
<br />
Training will aim to increase scores of good segment pairs (with the same label) and decrease scores of pairs with different labels, unless no more good pairs are left.<br /><br /><br />
<br />
Once the scores for all pairs are calculated, three steps are performed:<br />
<br />
1. The highest scoring pair ''<math>\,[a_i, a_j]</math>'' will be removed from the set of potential child node pairs ''<math>\,C</math>''. As well as any other pair containing either ''<math>\,a_i</math>'' or ''<math>\,a_j</math>''.<br />
<br />
2. Adjacency Matrix ''<math>\,A</math>'' is updated with a new row and column that reflects new segment along with its child segments.<br />
<br />
3. Potential new child pairs are added to ''<math>\,C</math>''.<br />
<br />
Steps 1-3 are repeated until all pairs are merged and only one parent activation is left in the set ''<math>\,C</math>''. The last remaining activation is at the root of the Recursive Neural Network that represents the whole image. <br />
<br />
The equation that determines the quality of the structure amongst other variants is simply the sum of all the local decisions:<br />
<br />
::::<math>s(RNN(\theta,x_i,\widehat y))=\sum_{d\in N(\widehat y)}s_d</math><br />
<br />
== Category Classifiers in the Tree ==<br />
<br />
One of the main advantages of this approach is that each node of the tree built by the RNN has associated with it a distributed feature representation (the parent vector p). This representation can be improved by adding a simple softmax layer to each RNN parent node to predict class labels:<br />
<br />
[[File:Im_5.PNG ]]<br />
<br />
When minimizing the cross-entropy error of this softmax layer, the error will backpropagate and influence both the RNN parameters and the word representations.<br />
<br />
=== Unsupervised Recursive Autoencoer for structure Prediction<ref>http://nlp.stanford.edu/pubs/SocherPenningtonHuangNgManning_EMNLP2011.pdf</ref>. ===<br />
Instead of using scores (as described above) to predict the tree structure, we can also use the reconstruction error to predict the structure. <br />
<br />
How do we find reconstruction error?<br />
<br />
1. p is the parent representation for children <math>\,[c_1;c_2]</math> (same as before)<br />
<math>\, p=f(W^{(1)}[c_1;c_2]+b^{(1)}) </math><br />
<br />
2. one way of assessing how well this p represents its children is to reconstruct the children in a reconstruction layer. <br />
<math>[c_1^';c_2^']=W^{(2)}p+b^{(2)} </math><br />
3. Then, the reconstruction is defined below. The goal is to minimize <math>\,E_{rec}([c_1];c_2)</math>.<br />
<math>E_{rec}([c_1;c_2])=\frac{1}{2}||[c_1;c_2]-[c_1^';c_2^']||^2 </math><br />
<br />
How to construct the tree?<br />
<br />
* It first takes the first pair of neighboring vecotrs <math>\, (c_1;c_2)=(x_1;x_2) </math>. We save the parent node and the resulting reconstruction error. The network shifted by one position and takes as input vectors <math> \,(c_1;c_2)=(x_2;x_3) </math> and obtains <math> \,p,\, E_{rec}</math>. Repeat the process until it hits the last pair. <br />
* Select the pair with lowest <math>\,E_{rec}</math>. <br />
<br />
eg. Given sequence <math>\,(x_1,x_2,x_3,x_4)</math>, we get lowest <math>\,E_{rec}</math> by the pair <math>\,(x_3, x_4)</math>. The new sequence then consists of <math>\,(x_1, x_2 , p(3,4))</math>.<br />
<br />
* The process repeats and treats the new vector <math>\,p(3,4)</math> like any other vector.<br />
* The process stops until it reaches a deterministic choice of collapsing the remaining two states into one parent. The tree is then recovered.<br />
<br />
= Learning =<br />
The objective function ''J'' is not differentiable due to hinge loss. Therefore, we must opt for the subgradient method (Ratliff et al., 2007) which computes a gradient-like method called the subgradient. Let <math>\theta = (W^{sem},W,W^{score},W^{label})</math>, then the gradient becomes:<br />
<br />
:<math><br />
\frac{\partial J}{\partial \theta} = \frac{1}{x} \sum_{i}\frac{\partial s(\hat{y}_i)}{\partial \theta} - \frac{\partial s(\hat{y}_i)}{\partial \theta} + \lambda\theta,<br />
</math><br />
<br />
where <math>s(\hat{y}_i) = s(RNN(\theta,x_i,\hat{y}_{max(\tau(x_i))}))</math> and <math>s(\hat{y}_i) = s(RNN(\theta,x_i,\hat{y}_{max(Y(x_i,l_i))}))</math>. In order to compute this gradient, we calculate the derivative by using backpropagation through structure (Goller & Küchler, 1996). L-BFGS was used over the complete training data to minimize the objective. This may cause problems in non-differentiable functions, but none was observed in practice.<br />
<br />
The Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is one of the most powerful iterative methods for solving unconstrained nonlinear optimization problems. The BFGS method approximates Newton's method, a class of hill-climbing optimization techniques that seeks a stationary point of a (preferably twice continuously differentiable) function<ref><br />
https://www.rtmath.net/help/html/9ba786fe-9b40-47fb-a64c-c3a492812581.htm<br />
</ref>.<br />
<br />
L-BFGS is short for Limited-memory BFGS, which is an iterative method for solving unconstrained nonlinear optimization problems that using a limited amount of computer memory. Thus it is particularly suited to problems with very large numbers of variables (e.g., >1000)<ref><br />
https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm<br />
</ref>.<br />
<br />
= Results =<br />
<br />
The parameters to tune in this algorithm are ''n'', the size of the hidden layer;'' κ'', the penalization term for incorrect parsing decisions and ''λ'', the regularization parameter. It was found that the method was quite robust, varying in performance by only a few percent for some perameter combinations. The parameter values used were ''n = 100, κ = 0.05<br />
and λ = 0.001.''<br />
<br />
Additional resources that are helpful for replicating and extending this work can be found [http://www.socher.org/index.php/Main/ParsingNaturalScenesAndNaturalLanguageWithRecursiveNeuralNetworks here] on the first author's personal website. This includes the source code for the project, as well as download links for the datasets used.<br />
<br />
== Scene understanding ==<br />
<br />
<br />
For training and testing the researchers opted for the Stanford Background dataset—a dataset that can roughly be categorized into three types: city, countryside and sea-side. The team labelled the images with these three labels and a SVM was trained using the average over all nodes' activations in the tree as features. With an accuracy of 88.1%, this algorithm outperforms the state-of-the art features for scene categorization, Gist descriptors, which obtained only 84.0%.<br />
The results are summarized in the following figure.<br />
[[File:pic6.png]]<br />
<br />
A single neural network layer followed by a softmax layer is also tested in this paper, which performed about 2% worse than the full RNN model.<br />
<br />
In order to show the learned feature representations captured import appearance and label information, the researchers visualized nearest neighbour super segments. The team computed nearest neighbours across all images and all such subtrees. The figure below shows the results. The first image in each row is a random subtree's top node and the remaining nodes are the closest subtrees in the dataset in terms of Euclidean distance between the vector representations.<br />
<br />
[[File:pic7.png]]<br />
<br />
== Natural language processing ==<br />
<br />
The method was also tested on natural language processing with the Wall Street Journal section of the Penn Treebank and was evaluated with the F-measure (Manning & Schütz, 1999). While the widely used Berkeley parser was not outperformed, the scores are close (91.63% vs 90.29%). Interestingly, no syntactic information of the child nodes is provided by the parser to the parent nodes. All syntatic information used is encoded in the learned continuous representations.<br />
<br />
Similar to the nearest neighbour scene subtrees, nearest neighbours for multiword phrases were collected. For example, "All the figures are adjusted for seasonal variatons" is a close neighbour to "All the numbers are adjusted for seasonal fluctuations".<br />
<br />
= Related work =<br />
Based on this work, the author published another paper to improve semantic representations, using Long Short-Term Memory (LSTM) networks which is a recurrent neural network with a more complex computational unit. It outperforms the existing systems on aspects of semantic relatedness and sentiment classification.<ref><br />
Tai K S, Socher R, Manning C D. Improved semantic representations from tree-structured long short-term memory networks[J]. arXiv preprint arXiv:1503.00075, 2015.<br />
</ref><br />
<br />
=Reference=<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946f15/Sequence_to_sequence_learning_with_neural_networks&diff=25551stat946f15/Sequence to sequence learning with neural networks2015-10-23T15:30:12Z<p>Mgohari2: </p>
<hr />
<div>= Introduction =<br />
The emergence of the Internet and other modern technology has greatly increased people's ability to communicate across vast distances and barriers. However, there still remains the fundamental barrier of languages and as anyone who has attempted to learn a new language can attest, it takes tremendous amount of work to learn more than one language past childhood. The ability to efficiently and quickly translate between languages would then be of great importance. This is an extremely difficult problem however as languages can have varying grammar and context always plays an important role. For example, the word "back" means entirely different things in the following two sentences,<br />
<br />
<blockquote><br />
I am in the back of the car.<br />
</blockquote><br />
<br />
<blockquote><br />
My back hurts.<br />
</blockquote><br />
<br />
Applying Deep Neural Networks (DNNs) to this problem is difficult given that DNNs can only be applied to problems where the inputs and output vectors are of fixed dimensions. This is suitable for applications such as image processing where the dimensions is a known ''a priori'', however in applications such as speech recognition, the dimension is not known. Thus, the goal of this paper is to introduce a domain independent method that learns to map sequences of input vectors to output vectors. Sutskever et al has approached this problem by applying Multi-Layer Long Short-Term Memory (LSTM) architecture<ref name=lstm></ref>, and used this architecture to estimate a conditional probability between input and output sequences. Specifically, they used one LSTM to obtain a large fixed-dimensional representation and another to extract the output sequence from that vector. Given that translations tend to be paraphrases of the source sentences, the translation objective encourages the LSTM to find sentence representations that capture their meaning, as sentences with similar meanings are close to each other while different sentences meanings will be far.<br />
<br />
The main result of this work is that on the WMT' 14 English to French translation task, their model obtained a BLEU (Bilingual Evaluation Understudy) score of 34.81 by extracting translations from an ensemble of 5 LTSMs. This is by far the best result achieved by direct translation from an artificial neural network. Also, the LSTM model did not suffer from long sentences, contrary to the recent experiences from researchers using similar architectures. Their model performed well on long sentences because they reversed the source sentences in the training and testing set. Reversing the sentences is a simple trick but it is one of the key contributions of their work.<br />
<br />
= Model =<br />
<br />
<br />
=== Theory of Recurrent Neural Networks ===<br />
<br />
Reall that an Artifical Neural Network (ANN) is a nonlinear function <math>\mathcal{N}:\mathbb{R}^{p_0}<br />
\rightarrow \mathbb{R}</math> that computes an output through iterative nonlinear function applications to an input vector <math>\mathbf{z}_0\in<br />
\mathbb{R}^{p_0}</math>. The updates are said to occur in ''layers'', producing the sequence of vectors associated with each layer as <math>\left\{\mathbf{z}_0,\mathbf{z}_1,\dots,\mathbf{z}_N\right\}</math>. Any <math>\mathbf{z}_n</math> for <math>n \in<br />
\left\{1,\ldots,N\right\}</math> is computed recursively from <math>\mathbf{z}_0</math> with the ''weight matrices'' and ''bias vectors'' as per the rule<br />
<br />
<math><br />
\mathbf{z}_n = <br />
\sigma_n\left(\mathbf{W}_{n}\mathbf{z}_{n-1} + \mathbf{b}_{n}\right), \quad 1 \leq n \leq N</math><br />
<br />
where <math>\mathbf{W}_{n} \in \mathbb{R}^{p_n \times p_{n-1}}</math> is the weight matrix and <math>\mathbf{b}_{n} \in \mathbb{R}^{p_n}</math> is the bias vector associated with the <math>n</math>th layer in the network.<br />
<br />
The element-wise vector function <math>\sigma_n\left(\cdot\right)</math> is sigmoid-like for each component in its domain, outputing a value that ranges in <math>[0,1]</math>. Typically, the functions <math>\left\{\sigma_n\left(\cdot\right)\right\}</math> are the same for <math>n < N</math>, but the final output <math>\sigma_N(\cdot)</math> depends on the network architecture—for instance, it may be a softmax function for multi-label classification. Thus the network is completed characterized by its weights and biases as the tuple <math>(\left\{\mathbf{W}_{n}\right\},\left\{\mathbf{b}_{n}\right\})</math>.<br />
<br />
A sample network for <math>N = 2</math> is depicted below: a graph of an ANN with <math>N = 2</math>, where the vertices represent the vectors <math>\left\{\mathbf{z}_n\right\}_{n=0}^2</math> in their respective layers. Edges denote computation, where the vector transformations have been overlaid to show sample dimensions of <math>\mathbf{W}_1</math> and <math>\mathbf{W}_2</math>, such that they match the vectors <math>\mathbf{z}_0</math> and <math>\mathbf{z}_1</math>.<br />
<center><br />
[[File:ann.png | frame | center |Fig 1. Graph of an ANN with <math>N = 2</math>, where the vertices represent the vectors <math>\left\{\mathbf{z}_n\right\}_{n=0}^2</math> in their respective layers. Edges denote computation, where the vector transformations have been overlaid to show sample dimensions of <math>\mathbf{W}_1</math> and <math>\mathbf{W}_2</math>, such that they match the vectors <math>\mathbf{z}_0</math> and <math>\mathbf{z}_1</math>.<br />
]]<br />
</center><br />
<br />
A Recurrent Neural Network is a generalization of an ANN for a ''sequence'' of inputs <math>\left\{\mathbf{z}_0^{[t]}\right\}</math> where <math>t \in<br />
\left\{1,\ldots,T\right\}</math> such that there are ''recurrent'' connections between the intermediary vectors <math>\left\{\mathbf{z}_n\right\}</math> for different so-called ''time steps''. These connections are made to represent conditioning on the previous vectors in the sequence: supposing the sequence were a vectorized representation of the words, an input to the network could be: <math>\left\{\mathbf{z}_0^{[1]},\mathbf{z}_0^{[2]},\mathbf{z}_0^{[3]}\right\} =<br />
\left\{\text{pass}, \text{the}, \text{sauce}\right\}</math>. In a language modelling problem for predictive text, the probability of obtaining <math>\mathbf{z}_0^{[3]}</math> is strongly conditioned on the previous words in the sequence. As such, additional recurrence weight matrices are added to the update rule for <math>1 \leq n \leq N</math> and <math>t > 1</math> to produce the recurrent update rule<br />
<br />
<math><br />
\mathbf{z}_n^{[t]} = <br />
\sigma_n\left(<br />
\mathbf{b}_{n}<br />
+ \mathbf{W}_{n}\mathbf{z}_{n-1}^{[t]} <br />
+ \mathbf{R}_n\mathbf{z}_{n}^{[t-1]}<br />
\right)</math><br />
<br />
where <math>\mathbf{R}_n \in \mathbb{R}^{p_n \times p_n}</math> is the recurrence matrix that relates the <math>n</math>th layer’s output for item <math>t</math> to its previous output for item <math>t-1</math>. The network architecture for a single layer <math>n</math> at step <math>t</math> is pictured below. This is a schematic of an RNN layer <math>n</math> at step <math>t</math> with recurrence on the output of <math>\mathbf{z}_n^{[t-1]}</math>, with the dimensions of the matrices <math>\mathbf{R}_{n}</math> and <math>\mathbf{W}_{n}</math> pictured.<br />
<center><br />
[[File:rnn.png | frame | center |Fig 2. Schematic of an RNN layer <math>n</math> at step <math>t</math> with recurrence on the output of <math>\mathbf{z}_n^{[t-1]}</math>, with the dimensions of the matrices <math>\mathbf{R}_{n}</math> and <math>\mathbf{W}_{n}</math> pictured. ]]<br />
</center><br />
<br />
=== RNN Architecture by Graves, 2013 ===<br />
<br />
The RNN update rule used by Sutskever et al. comes from a paper by Graves (2013). The connections between layers are denser in this case. The final layer is fully connected to every preceding layer execept for the input <math>\mathbf{z}_0^{[t]}</math>, and follows the update rule<br />
<br />
<math><br />
\mathbf{z}_{N}^{[t]} = \sigma_n\left(<br />
\mathbf{b}_N<br />
+ \displaystyle\sum_{n' = 1}^{N-1} \mathbf{W}_{N,n'}\mathbf{z}_{n'}^{[t]}<br />
\right)</math><br />
<br />
where <math>\mathbf{W}_{N,n'} \in \mathbb{R}^{p_N\times p_{n'}}</math> denotes the weight matrix between layer <math>n'</math> and <math>N</math>.<br />
<br />
The layers 2 through <math>N-1</math> have additional connections to <math>\mathbf{z}_0^{[t]}</math> as<br />
<br />
<math><br />
\mathbf{z}_n^{[t]} = \sigma_n\left(<br />
\mathbf{b}_{n}<br />
+ \mathbf{W}_{n}\mathbf{z}_{n-1}^{[t]} <br />
+ \mathbf{W}_{n,0}\mathbf{z}_0^{[t]} <br />
+ \mathbf{R}_n\mathbf{z}_{n}^{[t-1]}<br />
\right),</math><br />
<br />
where, again, <math>\mathbf{W}_{n,n'}</math> must be of size <math>\mathbb{R}^{p_n\times<br />
p_{n'}}</math>. The first layer has the typical RNN input rule as before,<br />
<br />
<math><br />
\mathbf{z}_{1}^{[t]} = \sigma_1\left(<br />
\mathbf{b}_{1}<br />
+ \mathbf{W}_{1}\mathbf{z}_{0}^{[t]} <br />
+ \mathbf{R}_{1}\mathbf{z}_{1}^{[t-1]}<br />
\right).<br />
</math><br />
<br />
=== Long Short-Term Memory Recurrent Neural Network (LSTM) ===<br />
Recurrent neural networks([http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ introduction to RNN]) are a variation of deep neural networks that are capable of storing information about previous hidden states in special memory layers.<ref name=lstm><br />
Hochreiter, Sepp, and Jürgen Schmidhuber. [http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf "Long short-term memory."] Neural computation 9.8 (1997): 1735-1780.<br />
</ref> ([http://colah.github.io/posts/2015-08-Understanding-LSTMs/ a quick introduction of LSTM]) Unlike feed forward neural networks that take in a single fixed length vector input and output a fixed length vector output, recurrent neural networks can take in a sequence of fixed length vectors as input because of their ability to store information and maintain a connection between inputs through this memory layer. By comparison, previous inputs would have no impact on current output for feed forward neural networks whereas they can impact current input in a recurrent neural network. (This paper used the LSTM formulation from Graves<ref name=grave><br />
Graves, Alex. [http://arxiv.org/pdf/1308.0850.pdf "Generating sequences with recurrent neural networks."] arXiv preprint arXiv:1308.0850 (2013).<br />
</ref>)<br />
<br />
<br />
This form of input fits naturally with language translation since sentences are sequences of words and many problems regarding representing variable length sentences as fixed length vectors can be avoided. However, training recurrent neural networks to learn long time lag dependencies where inputs many time steps back can heavily influence current output is difficult and generally results in exploding or vanishing gradients. A variation of recurrent neural networks, long short-term memory neural network, was used instead for this paper as they do not suffer as much from vanishing gradient problem.<br />
<br />
<br />
The purpose of LSTM in this case is to estimate the conditional probability of the output sequence, <math>\,(y_1,\cdots,y_{T'})</math>, based on the input sequence, <math>\,(x_1,\cdots,x_{T})</math>, where <math>\,T</math> does not have to equal <math>\,T'</math><br />
<br />
<br />
Let <math>\,v</math> represent the state of hidden layers after <math>\,(x_1,\cdots,x_{T})</math> have been inputted into the LSTM, i.e. what has been stored in the neural network's memory, then<br />
<br />
:<math>\,p(y_1,\cdots,y_{T'}|x_1,\cdots,x_{T})=\prod_{t=1}^{T'} p(y_t|v,y_1,\cdots,y_{t-1})</math><br />
<br />
For each <math>\,p(y_t|v,y_1,\cdots,y_{t-1})</math>, The LSTM neural network at time step <math>\,t</math> after <math>\,(x_1,\cdots,x_T,y_1,\cdots,y_{t-1})</math> have been inputted would output the relative probability of each word in the vocabulary and softmax function, <math>\,\frac{e^{x_b}}{\sum_{t=1}^N e^{x_t}}\,</math> can be applied to this output vector to generate the corresponding probability. From this, we can calculate any <math>\,p(y_1,\cdots,y_{T'}|x_1,\cdots,x_{T})</math> by repeatedly adding <math>\,y_t</math> as input into the LSTM neural network to calculate the new set of probabilities.<br />
<br />
The objective function used during the training process was:<br />
<br />
:<math>\,\frac{1}{|T_r|}\sum_{(S,T)\in T_r} log(p(T|S))\,</math><br />
<br />
Where <math>\,S</math> is the base/source sentence, <math>\,T</math> is the paired translated sentence and <math>\,T_r</math> is the total training set. This objective function is to maximize the log probability of a correct translation <math>\,T</math> given the base/source sentence <math>\,S</math> over the entire training set. Once the training is complete, translations are produced by finding the most likely translation according to LSTM:<br />
<br />
:<math>\hat{T} = \underset{T}{\operatorname{arg\ max}}\ p(T|S)</math><br />
<br />
<br />It has been showed that Long Short-Term Memory recurrent neural networks have the ability to generate both discrete and real-valued sequences with complex, long-range structure using next-step prediction <ref name=grave><br />
Reference<br />
</ref>.<br />
<br />
=== Input and Output Data Transformation ===<br />
About 12 million English-French sentence pairs were used during the training with a vocabulary of 160,000 for English and 80,000 for French. Any unknown words were replaced with a special token. All sentences were attached with an <EOS> token to indicate end of sentence.<br />
<br />
Additionally, input sentences were entered backwards as the researchers found this significantly increased accuracy. For example, using the sentence "Today I went to lectures.", the input order would be "lectures,to,went,I,Today". They suspect this is due to reduction of time lag between the beginning of each sentence.<br />
<br />
To decode a translation after training, a simple left to right beam search algorithm is used. This process goes as follows, a small number of initial translations with highest probabilities are picked at the start. Each translation is then re-entered into the LSTM independently and a new small set of words with highest probabilities are appended to the end of each translation. This repeats until <EOS> token is chosen and the completely translated sentence is added to the final translation set which is then ranked and highest ranking translation chosen.<br />
<br />
= Training and Results =<br />
=== Training Method ===<br />
Two LSTM neural networks were used overall; one to generate a fixed vector representation from the input sequence and another to generate the output sequence from the fixed vector representation. Each neural network had 4 layers and 1000 cells per layer and <math>\,v</math> can be represented by the 8000 real numbers in each cell's memory after the input sequence has been entered. Stochastic gradient descent with a batch size of 128 and learning rate of 0.7 was used. Initial parameters were set using a uniform distribution between -0.08 and 0.08. LSTM does not suffer from the vanishing gradient problem, but it can be affected by exploding gradients which is taken into account by enforcing a hard constraint on the norm of the gradient.<br />
<br />
=== Scoring Method ===<br />
Scoring was done using the BLEU (Bilingual Evaluation Understudy) metric. This is an algorithm created for evaluating the quality of machine translated text. This is done by using a modified form of precision to compare a produced translation against a set of reference translations. This metric tends to correlate well with human judgement across a corpus, but performs badly if used to evaluate individual sentences. More information can be found in the [http://www.aclweb.org/anthology/P02-1040.pdf BLEU paper] and the [https://en.wikipedia.org/wiki/BLEU wikipedia article]. These resources both state that the BLEU score is a number between 0 and 1, with closer to 1 corresponding to a better translation. The LSTM paper reports BLEU scores in percentage values.<br />
<br />
=== Results ===<br />
The resulting LSTM neural networks outperformed standard Statistical Machine Translation (SMT) with a BLEU score of 34.8 against 33.3 and with certain heuristics or modification, was very close to matching the best performing system. Additionally, it could recognize sentences in both active and passive voice as being similar.<br />
<blockquote><br />
Active Voice: I ate an apple.<br />
</blockquote><br />
<blockquote><br />
Passive Voice: The apple was eaten by me.<br />
</blockquote><br />
<br />
An interesting result is the fact that reversing the source sentences (not test sentences) improved the long sentence decoding, which in turn increased the BLEU score from 25.9 to 30.6. While the authors do not have a complete explanation, they theorize the improvement in performance is due to the introduction of many short term dependencies to the data-set, by reversing the source sentences they minimize the time lag between the end of the source and the start of the target sentence. This reduction in the time lag is what the authors believe help the LSTM architecture establish a link between source and target and utilize the memory feature of the network better. Note, that the mean time lag does not change. Given the input sequence <math>(x_1, \dots, x_T)</math> and the target sequence <math>(y_1, \dots, y_T)</math>, the sequence of time lags is <math>\Delta t = (T, \dots, T)</math> and <math>\frac{1}{T} \sum_{\Delta t_i} T = T</math>. If, however, the input is reversed, the sequence of time lags of corresponding inputs is <math>\Delta t = (1, 3, \dots, 2T - 1)</math> which still has a mean time lag of <math>\frac{1}{T} \sum_{\Delta t_i} (2i - 1) = \frac{1}{T} \sum_{i = 1}^{T/2} (2i + 2(T-i)) = T</math> (assuming even T, but odd T can be shown similarly). Thus, half of the time lags are shorter with the reversed input sequence.<br />
<br />
For example, let "I saw the man" be the source sentence, "with the binoculars" be the target sentence, if we concatenate both source and target sentences we have "I saw the man with the binoculars". By reversing the source sentence ("man the saw I") the subject "man" is now closer to the context target "binoculars", compared to if the source sentence is not reversed. Reversing the input sentences leads to more confident predictions in the early parts of the target sentence and to less confident predictions in the later parts. Also, it results in LSTMs with better memory utilization.<br />
<br />
In summary the LSTM method has proven to be quite capable of translating long sentences despite potentially long delay between input time steps. However, it still falls short of [Edinburgh's specialised statistical model http://www.statmt.org/OSMOSES/sysdesc.pdf].<br />
<br />
=== Some developments of LSTM ===<br />
Dealing with the very rare words are a challenge for the LSTM. A weakness of LSTM is its inability to (correctly) translate the very rare words that comes up with out-of-vocabulary words, i.e. no translation. The long short-term dependencies that are induced in this method can lead to less likelihood for a long sequence words come after the unknown words. In the other hand, these untranslated words impose a longer temporal memory that decrease the overall efficiency of network. Sutskever I ([http://www.aclweb.org/anthology/P15-1002.pdf]) suggested a method to address the rare word problem. They assumed that if the origin of any undefined word is known then the word can be look up by introducing a post-processing step. This step would replace each unknown word in the system’s output with a translation of its source word. They proposed three strategies to track the source and translate it using either a dictionary or the identity translation. <br />
<br />
=== Open questions ===<br />
<br />
The results of the paper pose some interesting questions which are not discussed in the paper itself:<br />
<br />
# Instead of reversing the input sequence the target sequence could be reversed. This would change the time lags between corresponding words in a similar way, but instead of reducing the time lag between the first half of corresponding words, it is reduced between the last half of the words. This might allow conclusions about whether the improved performance is purely due to the reduced minimal time lag or whether structure in natural language is also important (e.g. when a short time lag between the first few words is better than a short time lag between the last few words of sentence).<br />
# For half of the words the time lag increases to more than the average. Thus, they might have only a minor contribution to the model performance. It could be interesting to see how much the performance is affected by leaving those words out of the input sequence. Or more generally, one could ask, how does the performance related to the number of used input words?<br />
<br />
= More Formulations of Recurrent Neural Networks =<br />
The standard RNN is formalized as follows<br />
<br />
:<math>\,h_t=tanh(W_{hx}x_t+W_{hh}h_{t-1}+b_h)</math><br />
:<math>\,o_t=W_{oh}h_t+b_o</math><br />
<br />
Given sequence of input vectors <math>\,(x_1,\cdots,x_{T})</math>, the RNN computes a sequence of hidden states <math>\,(h_1,\cdots,h_{T})</math> and a sequence of output <math>\,(o_1,\cdots,o_{T})</math> by iterating the above equations. <math>\,W_{hx}</math> is the input to hidden weight matrix, <math>\,W_{hh}</math> is the hidden to hidden weight matrix, <math>\,W_{oh}</math> is the hidden to output weight matrix. Vector <math>\,b_{h}</math> and <math>\,b_{o}</math> are the biases. When t=1, the undefined <math>\,W_{hh}h_{t-1}</math> is replace with a special initial bias vector, <math>\,h_{init}</math>. <br />
<br />
It may seem to train RNNs with gradient descent, but in reality, gradient decays exponentially as it is backpropagated through time. The relation between parameter and dynamics of the RNN is highly unstable, which makes gradient descent ineffective. Thus, it argues that RNN can not learn long-range temporal dependencies when gradient descent is used for training. A good way to deal with inability of gradient descent to learn long-range temporal structure in RNN is known as "Long-Short Term memory". (http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf)<br />
<br />
There are different variants of LSTM<ref name=grave><br />
</ref><ref><br />
Gers, Felix, and Jürgen Schmidhuber. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=861302&tag=1 "Recurrent nets that time and count."] Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on. Vol. 3. IEEE, 2000.<br />
</ref><ref><br />
Cho, Kyunghyun, et al. [http://arxiv.org/pdf/1406.1078v3.pdf "Learning phrase representations using rnn encoder-decoder for statistical machine translation."] arXiv preprint arXiv:1406.1078 (2014).<br />
</ref> other than the original one proposed by Hochreiter et al.<ref name=lstm><br />
</ref> Greff et al. compare the performance of some different popular variants in their work<ref><br />
Greff, Klaus, et al. [http://arxiv.org/pdf/1503.04069.pdf "LSTM: A Search Space Odyssey."] arXiv preprint arXiv:1503.04069 (2015).<br />
</ref> and draw the conclusion that they are about the same. While Jozefowicz, et al. suggest that some architecture can perform better than LSTM on certain tasks<ref><br />
Jozefowicz, Rafal, Wojciech Zaremba, and Ilya Sutskever. [http://jmlr.org/proceedings/papers/v37/jozefowicz15.pdf "An Empirical Exploration of Recurrent Network Architectures."] Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 2015.<br />
</ref>.<br />
<br />
= Source =<br />
Sutskever, I. Vinyals, O. & Le. Q. V. Sequence to sequence learning<br />
with neural networks. In Proc. Advances in Neural Information<br />
Processing Systems 27 3104–3112 (2014).<br />
<references /></div>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946f15/Sequence_to_sequence_learning_with_neural_networks&diff=25529stat946f15/Sequence to sequence learning with neural networks2015-10-22T21:09:51Z<p>Mgohari2: </p>
<hr />
<div>= Introduction =<br />
The emergence of the Internet and other modern technology has greatly increased people's ability to communicate across vast distances and barriers. However, there still remains the fundamental barrier of languages and as anyone who has attempted to learn a new language can attest, it takes tremendous amount of work to learn more than one language past childhood. The ability to efficiently and quickly translate between languages would then be of great importance. This is an extremely difficult problem however as languages can have varying grammar and context always plays an important role. For example, the word "back" means entirely different things in the following two sentences,<br />
<br />
<blockquote><br />
I am in the back of the car.<br />
</blockquote><br />
<br />
<blockquote><br />
My back hurts.<br />
</blockquote><br />
<br />
Applying Deep Neural Networks (DNNs) to this problem is difficult given that DNNs can only be applied to problems where the inputs and output vectors are of fixed dimensions. This is suitable for applications such as image processing where the dimensions is a known ''a priori'', however in applications such as speech recognition, the dimension is not known. Thus, the goal of this paper is to introduce a domain independent method that learns to map sequences of input vectors to output vectors. Sutskever et al has approached this problem by applying Multi-Layer Long Short-Term Memory (LSTM) architecture, and used this architecture to estimate a conditional probability between input and output sequences. Specifically, they used one LSTM to obtain a large fixed-dimensional representation and another to extract the output sequence from that vector.<br />
<br />
The main result of this work is that on the WMT' 14 English to French translation task, their model obtained a BLEU (Bilingual Evaluation Understudy) score of 34.81 by extracting translations from an ensemble of 5 LTSMs. This is by far the best result achieved by direct translation from an artificial neural network. Also, the LSTM model did not suffer from long sentences, contrary to the recent experiences from researchers using similar architectures. Their model performed well on long sentences because they reversed the source sentences in the training and testing set. Reversing the sentences is a simple trick but it is one of the key contributions of their work.<br />
<br />
= Model =<br />
<br />
<br />
=== Theory of Recurrent Neural Networks ===<br />
<br />
Reall that an Artifical Neural Network (ANN) is a nonlinear function <math>\mathcal{N}:\mathbb{R}^{p_0}<br />
\rightarrow \mathbb{R}</math> that computes an output through iterative nonlinear function applications to an input vector <math>\mathbf{z}_0\in<br />
\mathbb{R}^{p_0}</math>. The updates are said to occur in ''layers'', producing the sequence of vectors associated with each layer as <math>\left\{\mathbf{z}_0,\mathbf{z}_1,\dots,\mathbf{z}_N\right\}</math>. Any <math>\mathbf{z}_n</math> for <math>n \in<br />
\left\{1,\ldots,N\right\}</math> is computed recursively from <math>\mathbf{z}_0</math> with the ''weight matrices'' and ''bias vectors'' as per the rule<br />
<br />
<math><br />
\mathbf{z}_n = <br />
\sigma_n\left(\mathbf{W}_{n}\mathbf{z}_{n-1} + \mathbf{b}_{n}\right), \quad 1 \leq n \leq N</math><br />
<br />
where <math>\mathbf{W}_{n} \in \mathbb{R}^{p_n \times p_{n-1}}</math> is the weight matrix and <math>\mathbf{b}_{n} \in \mathbb{R}^{p_n}</math> is the bias vector associated with the <math>n</math>th layer in the network.<br />
<br />
The element-wise vector function <math>\sigma_n\left(\cdot\right)</math> is sigmoid-like for each component in its domain, outputing a value that ranges in <math>[0,1]</math>. Typically, the functions <math>\left\{\sigma_n\left(\cdot\right)\right\}</math> are the same for <math>n < N</math>, but the final output <math>\sigma_N(\cdot)</math> depends on the network architecture—for instance, it may be a softmax function for multi-label classification. Thus the network is completed characterized by its weights and biases as the tuple <math>(\left\{\mathbf{W}_{n}\right\},\left\{\mathbf{b}_{n}\right\})</math>.<br />
<br />
A sample network for <math>N = 2</math> is depicted below: a graph of an ANN with <math>N = 2</math>, where the vertices represent the vectors <math>\left\{\mathbf{z}_n\right\}_{n=0}^2</math> in their respective layers. Edges denote computation, where the vector transformations have been overlaid to show sample dimensions of <math>\mathbf{W}_1</math> and <math>\mathbf{W}_2</math>, such that they match the vectors <math>\mathbf{z}_0</math> and <math>\mathbf{z}_1</math>.<br />
<center><br />
[[File:ann.png | frame | center |Fig 1. Graph of an ANN with <math>N = 2</math>, where the vertices represent the vectors <math>\left\{\mathbf{z}_n\right\}_{n=0}^2</math> in their respective layers. Edges denote computation, where the vector transformations have been overlaid to show sample dimensions of <math>\mathbf{W}_1</math> and <math>\mathbf{W}_2</math>, such that they match the vectors <math>\mathbf{z}_0</math> and <math>\mathbf{z}_1</math>.<br />
]]<br />
</center><br />
<br />
A Recurrent Neural Network is a generalization of an ANN for a ''sequence'' of inputs <math>\left\{\mathbf{z}_0^{[t]}\right\}</math> where <math>t \in<br />
\left\{1,\ldots,T\right\}</math> such that there are ''recurrent'' connections between the intermediary vectors <math>\left\{\mathbf{z}_n\right\}</math> for different so-called ''time steps''. These connections are made to represent conditioning on the previous vectors in the sequence: supposing the sequence were a vectorized representation of the words, an input to the network could be: <math>\left\{\mathbf{z}_0^{[1]},\mathbf{z}_0^{[2]},\mathbf{z}_0^{[3]}\right\} =<br />
\left\{\text{pass}, \text{the}, \text{sauce}\right\}</math>. In a language modelling problem for predictive text, the probability of obtaining <math>\mathbf{z}_0^{[3]}</math> is strongly conditioned on the previous words in the sequence. As such, additional recurrence weight matrices are added to the update rule for <math>1 \leq n \leq N</math> and <math>t > 1</math> to produce the recurrent update rule<br />
<br />
<math><br />
\mathbf{z}_n^{[t]} = <br />
\sigma_n\left(<br />
\mathbf{b}_{n}<br />
+ \mathbf{W}_{n}\mathbf{z}_{n-1}^{[t]} <br />
+ \mathbf{R}_n\mathbf{z}_{n}^{[t-1]}<br />
\right)</math><br />
<br />
where <math>\mathbf{R}_n \in \mathbb{R}^{p_n \times p_n}</math> is the recurrence matrix that relates the <math>n</math>th layer’s output for item <math>t</math> to its previous output for item <math>t-1</math>. The network architecture for a single layer <math>n</math> at step <math>t</math> is pictured below. This is a schematic of an RNN layer <math>n</math> at step <math>t</math> with recurrence on the output of <math>\mathbf{z}_n^{[t-1]}</math>, with the dimensions of the matrices <math>\mathbf{R}_{n}</math> and <math>\mathbf{W}_{n}</math> pictured.<br />
<center><br />
[[File:rnn.png | frame | center |Fig 2. Schematic of an RNN layer <math>n</math> at step <math>t</math> with recurrence on the output of <math>\mathbf{z}_n^{[t-1]}</math>, with the dimensions of the matrices <math>\mathbf{R}_{n}</math> and <math>\mathbf{W}_{n}</math> pictured. ]]<br />
</center><br />
<br />
=== RNN Architecture by Graves, 2013 ===<br />
<br />
The RNN update rule used by Sutskever et al. comes from a paper by Graves (2013). The connections between layers are denser in this case. The final layer is fully connected to every preceding layer execept for the input <math>\mathbf{z}_0^{[t]}</math>, and follows the update rule<br />
<br />
<math><br />
\mathbf{z}_{N}^{[t]} = \sigma_n\left(<br />
\mathbf{b}_N<br />
+ \displaystyle\sum_{n' = 1}^{N-1} \mathbf{W}_{N,n'}\mathbf{z}_{n'}^{[t]}<br />
\right)</math><br />
<br />
where <math>\mathbf{W}_{N,n'} \in \mathbb{R}^{p_N\times p_{n'}}</math> denotes the weight matrix between layer <math>n'</math> and <math>N</math>.<br />
<br />
The layers 2 through <math>N-1</math> have additional connections to <math>\mathbf{z}_0^{[t]}</math> as<br />
<br />
<math><br />
\mathbf{z}_n^{[t]} = \sigma_n\left(<br />
\mathbf{b}_{n}<br />
+ \mathbf{W}_{n}\mathbf{z}_{n-1}^{[t]} <br />
+ \mathbf{W}_{n,0}\mathbf{z}_0^{[t]} <br />
+ \mathbf{R}_n\mathbf{z}_{n}^{[t-1]}<br />
\right),</math><br />
<br />
where, again, <math>\mathbf{W}_{n,n'}</math> must be of size <math>\mathbb{R}^{p_n\times<br />
p_{n'}}</math>. The first layer has the typical RNN input rule as before,<br />
<br />
<math><br />
\mathbf{z}_{1}^{[t]} = \sigma_1\left(<br />
\mathbf{b}_{1}<br />
+ \mathbf{W}_{1}\mathbf{z}_{0}^{[t]} <br />
+ \mathbf{R}_{1}\mathbf{z}_{1}^{[t-1]}<br />
\right).<br />
</math><br />
<br />
=== Long Short-Term Memory Recurrent Neural Network (LSTM) ===<br />
Recurrent neural networks([http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ introduction to RNN]) are a variation of deep neural networks that are capable of storing information about previous hidden states in special memory layers.<ref name=lstm><br />
Hochreiter, Sepp, and Jürgen Schmidhuber. [http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf "Long short-term memory."] Neural computation 9.8 (1997): 1735-1780.<br />
</ref> ([http://colah.github.io/posts/2015-08-Understanding-LSTMs/ a quick introduction of LSTM]) Unlike feed forward neural networks that take in a single fixed length vector input and output a fixed length vector output, recurrent neural networks can take in a sequence of fixed length vectors as input because of their ability to store information and maintain a connection between inputs through this memory layer. By comparison, previous inputs would have no impact on current output for feed forward neural networks whereas they can impact current input in a recurrent neural network. (This paper used the LSTM formulation from Graves<ref name=grave><br />
Graves, Alex. [http://arxiv.org/pdf/1308.0850.pdf "Generating sequences with recurrent neural networks."] arXiv preprint arXiv:1308.0850 (2013).<br />
</ref>)<br />
<br />
<br />
This form of input fits naturally with language translation since sentences are sequences of words and many problems regarding representing variable length sentences as fixed length vectors can be avoided. However, training recurrent neural networks to learn long time lag dependencies where inputs many time steps back can heavily influence current output is difficult and generally results in exploding or vanishing gradients. A variation of recurrent neural networks, long short-term memory neural network, was used instead for this paper as they do not suffer as much from vanishing gradient problem.<br />
<br />
<br />
The purpose of LSTM in this case is to estimate the conditional probability of the output sequence, <math>\,(y_1,\cdots,y_{T'})</math>, based on the input sequence, <math>\,(x_1,\cdots,x_{T})</math>, where <math>\,T</math> does not have to equal <math>\,T'</math><br />
<br />
<br />
Let <math>\,v</math> represent the state of hidden layers after <math>\,(x_1,\cdots,x_{T})</math> have been inputted into the LSTM, i.e. what has been stored in the neural network's memory, then<br />
<br />
:<math>\,p(y_1,\cdots,y_{T'}|x_1,\cdots,x_{T})=\prod_{t=1}^{T'} p(y_t|v,y_1,\cdots,y_{t-1})</math><br />
<br />
For each <math>\,p(y_t|v,y_1,\cdots,y_{t-1})</math>, The LSTM neural network at time step <math>\,t</math> after <math>\,(x_1,\cdots,x_T,y_1,\cdots,y_{t-1})</math> have been inputted would output the relative probability of each word in the vocabulary and softmax function, <math>\,\frac{e^{x_b}}{\sum_{t=1}^N e^{x_t}}\,</math> can be applied to this output vector to generate the corresponding probability. From this, we can calculate any <math>\,p(y_1,\cdots,y_{T'}|x_1,\cdots,x_{T})</math> by repeatedly adding <math>\,y_t</math> as input into the LSTM neural network to calculate the new set of probabilities.<br />
<br />
The objective function used during the training process was:<br />
<br />
:<math>\,\frac{1}{|T_r|}\sum_{(S,T)\in T_r} log(p(T|S))\,</math><br />
<br />
Where <math>\,S</math> is the base/source sentence, <math>\,T</math> is the paired translated sentence and <math>\,T_r</math> is the total training set. This objective function is to maximize the log probability of a correct translation <math>\,T</math> given the base/source sentence <math>\,S</math> over the entire training set. Once the training is complete, translations are produced by finding the most likely translation according to LSTM:<br />
<br />
:<math>\hat{T} = \underset{T}{\operatorname{arg\ max}}\ p(T|S)</math><br />
<br />
=== Input and Output Data Transformation ===<br />
About 12 million English-French sentence pairs were used during the training with a vocabulary of 160,000 for English and 80,000 for French. Any unknown words were replaced with a special token. All sentences were attached with an <EOS> token to indicate end of sentence.<br />
<br />
Additionally, input sentences were entered backwards as the researchers found this significantly increased accuracy. For example, using the sentence "Today I went to lectures.", the input order would be "lectures,to,went,I,Today". They suspect this is due to reduction of time lag between the beginning of each sentence.<br />
<br />
To decode a translation after training, a simple left to right beam search algorithm is used. This process goes as follows, a small number of initial translations with highest probabilities are picked at the start. Each translation is then re-entered into the LSTM independently and a new small set of words with highest probabilities are appended to the end of each translation. This repeats until <EOS> token is chosen and the completely translated sentence is added to the final translation set which is then ranked and highest ranking translation chosen.<br />
<br />
= Training and Results =<br />
=== Training Method ===<br />
Two LSTM neural networks were used overall; one to generate a fixed vector representation from the input sequence and another to generate the output sequence from the fixed vector representation. Each neural network had 4 layers and 1000 cells per layer and <math>\,v</math> can be represented by the 8000 real numbers in each cell's memory after the input sequence has been entered. Stochastic gradient descent with a batch size of 128 and learning rate of 0.7 was used. Initial parameters were set using a uniform distribution between -0.08 and 0.08. LSTM does not suffer from the vanishing gradient problem, but it can be affected by exploding gradients which is taken into account by enforcing a hard constraint on the norm of the gradient.<br />
<br />
=== Scoring Method ===<br />
Scoring was done using the BLEU (Bilingual Evaluation Understudy) metric. This is an algorithm created for evaluating the quality of machine translated text. This is done by using a modified form of precision to compare a produced translation against a set of reference translations. This metric tends to correlate well with human judgement across a corpus, but performs badly if used to evaluate individual sentences. More information can be found in the [http://www.aclweb.org/anthology/P02-1040.pdf BLEU paper] and the [https://en.wikipedia.org/wiki/BLEU wikipedia article]. These resources both state that the BLEU score is a number between 0 and 1, with closer to 1 corresponding to a better translation. The LSTM paper reports BLEU scores in percentage values.<br />
<br />
=== Results ===<br />
The resulting LSTM neural networks outperformed standard Statistical Machine Translation (SMT) with a BLEU score of 34.8 against 33.3 and with certain heuristics or modification, was very close to matching the best performing system. Additionally, it could recognize sentences in both active and passive voice as being similar.<br />
<blockquote><br />
Active Voice: I ate an apple.<br />
</blockquote><br />
<blockquote><br />
Passive Voice: The apple was eaten by me.<br />
</blockquote><br />
<br />
An interesting result is the fact that reversing the source sentences (not test sentences) improved the long sentence decoding, which in turn increased the BLEU score from 25.9 to 30.6. While the authors do not have a complete explanation, they theorize the improvement in performance is due to the introduction of many short term dependencies to the data-set, by reversing the source sentences they minimize the time lag between the end of the source and the start of the target sentence. This reduction in the time lag is what the authors believe help the LSTM architecture establish a link between source and target and utilize the memory feature of the network better. Note, that the mean time lag does not change. Given the input sequence <math>(x_1, \dots, x_T)</math> and the target sequence <math>(y_1, \dots, y_T)</math>, the sequence of time lags is <math>\Delta t = (T, \dots, T)</math> and <math>\frac{1}{T} \sum_{\Delta t_i} T = T</math>. If, however, the input is reversed, the sequence of time lags of corresponding inputs is <math>\Delta t = (1, 3, \dots, 2T - 1)</math> which still has a mean time lag of <math>\frac{1}{T} \sum_{\Delta t_i} (2i - 1) = \frac{1}{T} \sum_{i = 1}^{T/2} (2i + 2(T-i)) = T</math> (assuming even T, but odd T can be shown similarly). Thus, half of the time lags are shorter with the reversed input sequence.<br />
<br />
For example, let "I saw the man" be the source sentence, "with the binoculars" be the target sentence, if we concatenate both source and target sentences we have "I saw the man with the binoculars". By reversing the source sentence ("man the saw I") the subject "man" is now closer to the context target "binoculars", compared to if the source sentence is not reversed.<br />
<br />
In summary the LSTM method has proven to be quite capable of translating long sentences despite potentially long delay between input time steps. However, it still falls short of [Edinburgh's specialised statistical model http://www.statmt.org/OSMOSES/sysdesc.pdf].<br />
<br />
=== Shortcomings of LSTM ===<br />
Dealing with the very rare words are a challenge for the LSTM. A weakness of LSTM is its inability to (correctly) translate the very rare words that comes up with out-of-vocabulary words, i.e. no translation. The long short-term dependencies that are induced in this method can lead to less likelihood for a long sequence words come after the unknown words. In the other hand, these untranslated words impose a longer temporal memory that decrease the overall efficiency of network. Sutskever I ([http://www.aclweb.org/anthology/P15-1002.pdf]) suggested a method to address the rare word problem. They assumed that if the origin of any undefined word is known then the word can be look up by introducing a post-processing step. This step would replace each unknown word in the system’s output with a translation of its source word. They proposed three strategies to track the source and translate it using either a dictionary or the identity translation. <br />
<br />
=== Open questions ===<br />
<br />
The results of the paper pose some interesting questions which are not discussed in the paper itself:<br />
<br />
# Instead of reversing the input sequence the target sequence could be reversed. This would change the time lags between corresponding words in a similar way, but instead of reducing the time lag between the first half of corresponding words, it is reduced between the last half of the words. This might allow conclusions about whether the improved performance is purely due to the reduced minimal time lag or whether structure in natural language is also important (e.g. when a short time lag between the first few words is better than a short time lag between the last few words of sentence).<br />
# For half of the words the time lag increases to more than the average. Thus, they might have only a minor contribution to the model performance. It could be interesting to see how much the performance is affected by leaving those words out of the input sequence. Or more generally, one could ask, how does the performance related to the number of used input words?<br />
<br />
= More Formulations of Recurrent Neural Networks =<br />
The standard RNN is formalized as follows<br />
<br />
:<math>\,h_t=tanh(W_{hx}x_t+W_{hh}h_{t-1}+b_h)</math><br />
:<math>\,o_t=W_{oh}h_t+b_o</math><br />
<br />
Given sequence of input vectors <math>\,(x_1,\cdots,x_{T})</math>, the RNN computes a sequence of hidden states <math>\,(h_1,\cdots,h_{T})</math> and a sequence of output <math>\,(o_1,\cdots,o_{T})</math> by iterating the above equations. <math>\,W_{hx}</math> is the input to hidden weight matrix, <math>\,W_{hh}</math> is the hidden to hidden weight matrix, <math>\,W_{oh}</math> is the hidden to output weight matrix. Vector <math>\,b_{h}</math> and <math>\,b_{o}</math> are the biases. When t=1, the undefined <math>\,W_{hh}h_{t-1}</math> is replace with a special initial bias vector, <math>\,h_{init}</math>. <br />
<br />
It may seem to train RNNs with gradient descent, but in reality, gradient decays exponentially as it is backpropagated through time. The relation between parameter and dynamics of the RNN is highly unstable, which makes gradient descent ineffective. Thus, it argues that RNN can not learn long-range temporal dependencies when gradient descent is used for training. A good way to deal with inability of gradient descent to learn long-range temporal structure in RNN is known as "Long-Short Term memory". (http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf)<br />
<br />
There are different variants of LSTM<ref name=grave><br />
</ref><ref><br />
Gers, Felix, and Jürgen Schmidhuber. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=861302&tag=1 "Recurrent nets that time and count."] Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on. Vol. 3. IEEE, 2000.<br />
</ref><ref><br />
Cho, Kyunghyun, et al. [http://arxiv.org/pdf/1406.1078v3.pdf "Learning phrase representations using rnn encoder-decoder for statistical machine translation."] arXiv preprint arXiv:1406.1078 (2014).<br />
</ref> other than the original one proposed by Hochreiter et al.<ref name=lstm><br />
</ref> Greff et al. compare the performance of some different popular variants in their work<ref><br />
Greff, Klaus, et al. [http://arxiv.org/pdf/1503.04069.pdf "LSTM: A Search Space Odyssey."] arXiv preprint arXiv:1503.04069 (2015).<br />
</ref> and draw the conclusion that they are about the same. While Jozefowicz, et al. suggest that some architecture can perform better than LSTM on certain tasks<ref><br />
Jozefowicz, Rafal, Wojciech Zaremba, and Ilya Sutskever. [http://jmlr.org/proceedings/papers/v37/jozefowicz15.pdf "An Empirical Exploration of Recurrent Network Architectures."] Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 2015.<br />
</ref>.<br />
<br />
= Source =<br />
Sutskever, I. Vinyals, O. & Le. Q. V. Sequence to sequence learning<br />
with neural networks. In Proc. Advances in Neural Information<br />
Processing Systems 27 3104–3112 (2014).<br />
<references /></div>Mgohari2