statwiki - User contributions [US]

proposal for STAT946 (Deep Learning) final projects Fall 2015

2015-12-18T05:15:20Z

Amirlk:

'''Project 0:''' (This is just an example)

'''Group members:'''first name family name, first name family name, first name family name

'''Title:''' Sentiment Analysis on Movie Reviews

''' Description:''' The idea and data for this project is taken from http://www.kaggle.com/c/sentiment-analysis-on-movie-reviews.
Sentiment analysis is the problem of determining whether a given string contains positive or negative sentiment. For example, “A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story” contains negative sentiment, but it is not immediately clear which parts of the sentence make it so.
This competition seeks to implement machine learning algorithms that can determine the sentiment of a movie review

'''Project 1:'''

'''Group members:''' Sean Aubin, Brent Komer

'''Title:''' Convolution Neural Networks in SLAM

''' Description:''' We will try to replicate the results reported in [http://arxiv.org/abs/1411.1509 Convolutional Neural Networks-based Place Recognition] using [http://caffe.berkeleyvision.org/ Caffe] and [http://arxiv.org/abs/1409.4842 Google-net]. As a "stretch" goal, we will try to convert the CNN to a spiking neural network (a technique created by Eric Hunsberger) for greater biological plausibility and easier integration with other cognitive systems using Nengo. This work will help Brent with starting his PHD investigating cognitive localisation systems and object manipulation.

'''Project 2:'''

'''Group members:''' Xinran Liu, Fatemeh Karimi, Deepak Rishi & Chris Choi

'''Title:''' Image Classification with Deep Learning

''' Description:''' Our aim is to participate in the Digital Recognizer Kaggle Challenge, where one has to correctly classify the Modified National Institute of Standards and Technology (MNIST) dataset of handwritten numerical digits. For our first approach we propose using a simple Feed-Forward Neural Network to form a baseline for comparison. We then plan on experimenting on different aspects of a Neural Network such as network architecture, activation functions and incorporate a wide variety of training methods.

'''Project 3'''

'''Group members:''' Ri Wang, Maysum Panju, Mahmood Gohari

'''Title:''' Machine Translation Using Neural Networks

'''Description:''' The goal of this project is to translate languages using different types of neural networks and the algorithms described in "Sequence to sequence learning with neural networks." and "Neural machine translation by jointly learning to align and translate". Different vector representations for input sentences (word frequency, Word2Vec, etc) will be used and all combinations of algorithms will be ranked in terms of accuracy.
Our data will mainly be from [http://www.statmt.org/europarl/ Europarl] and [https://tatoeba.org/eng Tatoeba]. The common target language will be English to allow for easier judgement of translation quality.

'''Project 4'''

'''Group members:''' Peter Blouw, Jan Gosmann

'''Title:''' Using Structured Representations in Memory Networks to Perform Question Answering

'''Description:''' Memory networks are machine learning systems that combine memory and inference to perform tasks that involve sophisticated reasoning (see [http://arxiv.org/pdf/1410.3916.pdf here] and [http://arxiv.org/pdf/1502.05698v7.pdf here]). Our goal in this project is to first implement a memory network that replicates prior performance on the bAbl question-answering tasks described in [http://arxiv.org/pdf/1502.05698v7.pdf Weston et al. (2015)]. Then, we hope to improve upon this baseline performance by using more sophisticated representations of the sentences that encode questions being posed to the network. Current implementations often use a bag of words encoding, which throws out important syntactic information that is relevant to determining what a particular question is asking. As such, we will explore the use of things like POS tags, n-gram information, and parse trees to augment memory network performance.

'''Project 5'''

'''Group members:''' Tim Tse

'''Title:''' The Allen AI Science Challenge

'''Description:''' The goal of this project is to create an artificial intelligence model that can answer multiple-choice questions on a grade 8 science exam, with a success rate better than the best 8th graders. This will involve a deep neural network as the underlying model, to help parse the large amount of information needed to answer these questions. The model should also learn, over time, how to make better answers by acquiring more and more data. This is a Kaggle challenge, and the link to the challenge is [https://www.kaggle.com/c/the-allen-ai-science-challenge here]. The data to produce the model will come from the Kaggle website.

'''Project 6'''

'''Group members:''' Valerie Platsko

'''Title:''' Classification for P300-Speller Using Convolutional Neural Networks

''' Description:''' The goal of this project is to replicate (and possibly extend) the results in [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5492691 Convolutional Neural Networks for P300 Detection with Application to Brain-Computer Interfaces], which used convolutional neural networks to recognize P300 responses in recorded EEG and additionally to correctly recognize attended targets.(In the P300-Speller application, letters flash in rows and columns, so a single P300 response is associated with multiple potential targets.) The data in the paper came from http://www.bbci.de/competition/iii/ (dataset II), and there is an additional P300 Speller dataset available from [http://www.bbci.de/competition/ii/ a previous version of the competition].

'''Project 7'''

'''Group members:''' Amirreza Lashkari, Derek Latremouille, Rui Qiao and Luyao Ruan

'''Title:''' Bag of Words Meets Bags of Popcorn

''' Description:''' Sentiment analysis is a challenging subject in machine learning. People express their emotions in language that is often obscured by sarcasm, ambiguity, and plays on words, all of which could be very misleading for both humans and computers. In this project, features are extracted from IMDB movie reviews are classified into good or bad reviews about a movie with the help of convolutional neural network and Dov2Vec model. This is a Kaggle challenge (see [https://www.kaggle.com/c/word2vec-nlp-tutorial here]).

'''Project 8'''

'''Group members:''' Abdullah Rashwan and Priyank Jaini

'''Title:''' Learning the Parameters for Continuous Distribution Sum-Product Networks using Bayesian Moment Matching

'''Description:''' Sum-Product Networks have generated interest due to their ability to do exact inference in linear time with respect to the size of the network. Parameter learning however still is a problem. We have proposed an online Bayesian Moment Matching algorithm to learn the parameters for discrete distributions, in this work, we are extending the algorithm to learn the parameters for continuous distributions as well.

'''Project 9'''

'''Group members:''' Anthony Caterini

'''Title:''' Critical Analysis of the Manifold Tangent Classifier

'''Description:''' This project aims to thoroughly analyze the [http://papers.nips.cc/paper/4409-the-manifold-tangent-classifier.pdf Manifold Tangent Classifier]. The goal of this project is to implement the classifier as in the paper, and to attempt to formalize some of the geometric interpretation of the algorithm's formulation.

continuous space language models

2015-12-11T20:42:37Z

Amirlk: /* Conclusion */

= Introduction =
In certain fields of study such as speech recognition or machine translation, for some acoustic signal <math>\,x</math> or the source sentence to be translated <math>\,e</math>, it is common to model these problems as finding the sequence of words <math>\,w^*</math> that has the highest probability of occurring given <math>\,x</math> or <math>\,e</math>. This can be written as:

<math>w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w)</math>

An acoustic or translation model can then be used for <math>\,P(x|w)</math>, similar to the idea behind LDA and QDA, and it remains to create a language model <math>\,P(w)</math> to estimate the probability of any sequence of words <math>\,w</math>.

This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate <math>\,P(w)</math>.

= Back-off n-grams Model =

A sequence of words will be defined as <math>\,w^i_1=(w_1,w_2,\dots,w_i)</math> and the formula for the probability <math>\,P(w)</math> can be rewritten as:

<math>P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1)</math>

It is common to estimate <math>\,P(w_i|w^{i-1}_1)</math> through:

<math>\,P(w_i|w^{i-1}_1)\approx\frac{\mbox{number of occurrence of the sequence} (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence} (w_1,\dots,w_{i-1})}</math>

However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.

To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:

<math>\,P(w_i|w^{i-1}_1) = \begin{cases}
\frac{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_{i-1})}, & \mbox{if number of occurrence of}\ (w_1,\dots,w_i)\ \mbox{is greater than some constant K} \\
\alpha P(w_i|w^{i-1}_2), & \mbox{otherwise}
\end{cases}</math>

<math>\,\alpha</math> is typically a discounting factor that is less than 1 to account for the lack of direct data. It usually depends on the word sequence.

The general algorithm is then, if the data set does contain the sequence then calculate probability directly. Otherwise, apply a discounting factor and calculate the conditional probability with the first word in the sequence removed. For example, if the word sequence was "The dog barked" and it did not exist in the training set then the formula would be written as:

<math>\,P(\mbox{barked}|\mbox{the,dog}) \approx \alpha P(\mbox{barked}|\mbox{dog})</math>

= Model =

The researchers for this paper sought to find a better model for this probability than the back-off n-grams model. Their approach was to map the n-1 words sequence onto a multi-dimension continuous space using a layer of neural network followed by another layer to estimate the probabilities of all possible next words. The formulas and model goes as follows:

For some sequence of n-1 words, encode each word using 1 of K encoding, i.e. 1 where the word is indexed and zero everywhere else. Label each 1 of K encoding by <math>(w_{j-n+1},\dots,w_j)</math> for some n-1 word sequence at the j'th word in some larger context.

Let P be a projection matrix common to all n-1 words and let

<math>\,a_i=Pw_{j-n+i},i=1,\dots,n-1</math>

Let H be the weight matrix from the projection layer to the hidden layer and the state of H would be:

<math>\,h=tanh(Ha + b)</math> where A is the concatenation of all <math>\,a_i</math> and <math>\,b</math> is some bias vector

Finally, the output vector would be:

<math>\,o=Vh+k</math> where V is the weight matrix from hidden to output and k is another bias vector. <math>\,o</math> would be a vector with same dimensions as the total vocabulary size and the probabilities can be calculated from <math>\,o</math> by applying the softmax function.

= Optimization and Training =
The training was done with standard back-propagation on minimizing the error function:

<math>\,E=\sum_{i=1}^N t_i\ log p_i + \epsilon(\sum_{i,j}h^2_{ij}+\sum_{i,j}v^2_{ij})</math>

<math>\,t_i</math> is the desired output vector and the summations inside the epsilon bracket are regularization terms to prevent overfitting of <math>\,H</math> and <math>\,V</math>.

The researchers used stochastic gradient descent to prevent having to sum over millions of examples worth of error and this sped up training time.

An issue the researchers ran into using this model was that it took a long time to calculate language model probabilities compared to traditional back-off n-grams model and reduced its suitability for real time predictions. To solve this issue, several optimization techniques were used.

===Lattice rescoring===

It is common to keep track of additional possible solutions instead of just the most obviously likely solution in a lattice structure, i.e. a tree like structure where branches can merge and each branch represents a possible solution. For example from the paper using a tri-gram model, i.e. predict third word from first two words, the following lattice structure was formed:

[[File:Lattice.PNG]]

Any particular branch where two nodes have the same words can be merged. For example, "a,problem" was merged in the middle of the lattice because the tri-gram model would estimate the same probability at the point for both branch. Similary, "that_is,not" and "there_is,not" cannot be merged before the preceding two words to predict with are different.

After this structure is created with a traditional back-off n-grams model, the neural network is then used to re-score the lattice and the re-scored lattice is used to make predictions.

===Short List===

In any language, there is usually a small set of commonly used words that form almost all of written or spoken thought. The short-list idea is that rather than calculating every single probability for even the rarest words, the neural network only calculates a small subset of the most common words. This way, the output vector can be significantly shrunk from <math>\,\mbox{N}</math> to some much smaller number <math>\,\mbox{S}</math>.

If any rare words do occur, their probabilities are calculated using the traditional back-off n-grams model. The formula then goes as follows from the paper:

[[File:shortlist.PNG]]

Where L is the event that <math>\,w_t</math> is in the short-list.

===Sorting and Bunch===

The neural network predicts all the probabilities based on some sequence of words. If the probability of two different sequences of words are required but their relationship is such that for sequence 1, <math>\,w=(w_1,\dots,w_{i-1},w_i)</math> and sequence 2, <math>\,w^'=(w_1,\dots,w_{i-1},w^'_i)</math>, they differ only in the last word. Then only a single feed through the neural network is required. This is because the output vector using the context <math>\,(w_1,\dots,w_{i-1})</math> would predict the probabilities for both <math>\,w_i</math> and <math>\,w^'_i</math> being next. Therefore it is efficient to merge any sequence who have the same context.

Modern day computers are also very optimized for linear algebra and it is more efficient to run multiple examples at the same time through the matrix equations. The researchers called this bunching and simple testing showed that this decreased processing time by a factor of 10 when using 128 examples at once compared to 1.

= Training and Usage =

The researchers used numerous optimization techniques during training and their results were summarized in the paper as follows:

[[File:fast_training.PNG]]

Since the model only trains to predict based on the last n-1 words, at certain points there will be less than n-1 words and adjustments must be made. The researchers considered two possibilities, using traditional models for these n-grams or filling up the n-k words with some filler word up to n-1. After some testing, they found that requests for small n-gram probabilities were pretty low and they decided to use traditional back-off n-gram model for these cases.

= Results =

In general the results were quite good. When this neural network + back-off n-grams hybrid was used in combination with a number of acoustic speech recognition models, they found that perplexity, lower the better, decreased by about 10% in a number of cases compared with traditional back-off n-grams only model. Some of their results are summarized as follows:

[[File:results1.PNG]]

[[File:results2.PNG]]

= Conclusion =

This paper described the theory and an experimental evaluation of a new approach to language modeling for large vocabulary continuous speech recognition based on the idea to project the words onto a continuous space and to perform the probability estimation in this space. This method is fast to the level that the neural network language model can be used in a real-time speech recognizer. The necessary capacity of the neural network is an important issue. Three possibilities were explored: increasing the size of the hidden layer, training several networks and interpolating them together, and using large projection layers. The neural network language model is able to cover different speaking styles, ranging from rather well formed speech with few errors (broadcast news) to very relaxed speaking with many errors in syntax and semantics (meetings and conversations). It is claimed that the combination of the developed neural network and a back-off language model can be considered as a serious alternative to the commonly used back-off language models alone.

= Source =
Schwenk, H. Continuous space language models. Computer Speech
Lang. 21, 492–518 (2007). ISIArticle

continuous space language models

2015-12-11T20:40:36Z

Amirlk: /* Conclusion */

= Introduction =
In certain fields of study such as speech recognition or machine translation, for some acoustic signal <math>\,x</math> or the source sentence to be translated <math>\,e</math>, it is common to model these problems as finding the sequence of words <math>\,w^*</math> that has the highest probability of occurring given <math>\,x</math> or <math>\,e</math>. This can be written as:

<math>w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w)</math>

An acoustic or translation model can then be used for <math>\,P(x|w)</math>, similar to the idea behind LDA and QDA, and it remains to create a language model <math>\,P(w)</math> to estimate the probability of any sequence of words <math>\,w</math>.

This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate <math>\,P(w)</math>.

= Back-off n-grams Model =

A sequence of words will be defined as <math>\,w^i_1=(w_1,w_2,\dots,w_i)</math> and the formula for the probability <math>\,P(w)</math> can be rewritten as:

<math>P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1)</math>

It is common to estimate <math>\,P(w_i|w^{i-1}_1)</math> through:

<math>\,P(w_i|w^{i-1}_1)\approx\frac{\mbox{number of occurrence of the sequence} (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence} (w_1,\dots,w_{i-1})}</math>

However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.

To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:

<math>\,P(w_i|w^{i-1}_1) = \begin{cases}
\frac{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_{i-1})}, & \mbox{if number of occurrence of}\ (w_1,\dots,w_i)\ \mbox{is greater than some constant K} \\
\alpha P(w_i|w^{i-1}_2), & \mbox{otherwise}
\end{cases}</math>

<math>\,\alpha</math> is typically a discounting factor that is less than 1 to account for the lack of direct data. It usually depends on the word sequence.

The general algorithm is then, if the data set does contain the sequence then calculate probability directly. Otherwise, apply a discounting factor and calculate the conditional probability with the first word in the sequence removed. For example, if the word sequence was "The dog barked" and it did not exist in the training set then the formula would be written as:

<math>\,P(\mbox{barked}|\mbox{the,dog}) \approx \alpha P(\mbox{barked}|\mbox{dog})</math>

= Model =

The researchers for this paper sought to find a better model for this probability than the back-off n-grams model. Their approach was to map the n-1 words sequence onto a multi-dimension continuous space using a layer of neural network followed by another layer to estimate the probabilities of all possible next words. The formulas and model goes as follows:

For some sequence of n-1 words, encode each word using 1 of K encoding, i.e. 1 where the word is indexed and zero everywhere else. Label each 1 of K encoding by <math>(w_{j-n+1},\dots,w_j)</math> for some n-1 word sequence at the j'th word in some larger context.

Let P be a projection matrix common to all n-1 words and let

<math>\,a_i=Pw_{j-n+i},i=1,\dots,n-1</math>

Let H be the weight matrix from the projection layer to the hidden layer and the state of H would be:

<math>\,h=tanh(Ha + b)</math> where A is the concatenation of all <math>\,a_i</math> and <math>\,b</math> is some bias vector

Finally, the output vector would be:

<math>\,o=Vh+k</math> where V is the weight matrix from hidden to output and k is another bias vector. <math>\,o</math> would be a vector with same dimensions as the total vocabulary size and the probabilities can be calculated from <math>\,o</math> by applying the softmax function.

= Optimization and Training =
The training was done with standard back-propagation on minimizing the error function:

<math>\,E=\sum_{i=1}^N t_i\ log p_i + \epsilon(\sum_{i,j}h^2_{ij}+\sum_{i,j}v^2_{ij})</math>

<math>\,t_i</math> is the desired output vector and the summations inside the epsilon bracket are regularization terms to prevent overfitting of <math>\,H</math> and <math>\,V</math>.

The researchers used stochastic gradient descent to prevent having to sum over millions of examples worth of error and this sped up training time.

An issue the researchers ran into using this model was that it took a long time to calculate language model probabilities compared to traditional back-off n-grams model and reduced its suitability for real time predictions. To solve this issue, several optimization techniques were used.

===Lattice rescoring===

It is common to keep track of additional possible solutions instead of just the most obviously likely solution in a lattice structure, i.e. a tree like structure where branches can merge and each branch represents a possible solution. For example from the paper using a tri-gram model, i.e. predict third word from first two words, the following lattice structure was formed:

[[File:Lattice.PNG]]

Any particular branch where two nodes have the same words can be merged. For example, "a,problem" was merged in the middle of the lattice because the tri-gram model would estimate the same probability at the point for both branch. Similary, "that_is,not" and "there_is,not" cannot be merged before the preceding two words to predict with are different.

After this structure is created with a traditional back-off n-grams model, the neural network is then used to re-score the lattice and the re-scored lattice is used to make predictions.

===Short List===

In any language, there is usually a small set of commonly used words that form almost all of written or spoken thought. The short-list idea is that rather than calculating every single probability for even the rarest words, the neural network only calculates a small subset of the most common words. This way, the output vector can be significantly shrunk from <math>\,\mbox{N}</math> to some much smaller number <math>\,\mbox{S}</math>.

If any rare words do occur, their probabilities are calculated using the traditional back-off n-grams model. The formula then goes as follows from the paper:

[[File:shortlist.PNG]]

Where L is the event that <math>\,w_t</math> is in the short-list.

===Sorting and Bunch===

The neural network predicts all the probabilities based on some sequence of words. If the probability of two different sequences of words are required but their relationship is such that for sequence 1, <math>\,w=(w_1,\dots,w_{i-1},w_i)</math> and sequence 2, <math>\,w^'=(w_1,\dots,w_{i-1},w^'_i)</math>, they differ only in the last word. Then only a single feed through the neural network is required. This is because the output vector using the context <math>\,(w_1,\dots,w_{i-1})</math> would predict the probabilities for both <math>\,w_i</math> and <math>\,w^'_i</math> being next. Therefore it is efficient to merge any sequence who have the same context.

Modern day computers are also very optimized for linear algebra and it is more efficient to run multiple examples at the same time through the matrix equations. The researchers called this bunching and simple testing showed that this decreased processing time by a factor of 10 when using 128 examples at once compared to 1.

= Training and Usage =

The researchers used numerous optimization techniques during training and their results were summarized in the paper as follows:

[[File:fast_training.PNG]]

Since the model only trains to predict based on the last n-1 words, at certain points there will be less than n-1 words and adjustments must be made. The researchers considered two possibilities, using traditional models for these n-grams or filling up the n-k words with some filler word up to n-1. After some testing, they found that requests for small n-gram probabilities were pretty low and they decided to use traditional back-off n-gram model for these cases.

= Results =

In general the results were quite good. When this neural network + back-off n-grams hybrid was used in combination with a number of acoustic speech recognition models, they found that perplexity, lower the better, decreased by about 10% in a number of cases compared with traditional back-off n-grams only model. Some of their results are summarized as follows:

[[File:results1.PNG]]

[[File:results2.PNG]]

= Conclusion =

This paper described the theory and an experimental evaluation of a new approach to language modeling for large vocabulary continuous speech recognition based on the idea to project the words onto a continuous space and to perform the probability estimation in this space. This method is fast to the level that the neural network language model can be used in a real-time speech recognizer. The necessary capacity of the neural network is an important issue. Three possibilities were explored: increasing the size of the hidden layer, training several networks and interpolating them together, and using large projection layers. The neural network language model is able to cover different speaking styles, ranging from rather well formed speech with few errors (broadcast news) to very relaxed speaking with many errors in syntax and semantics (meetings and conversations).

= Source =
Schwenk, H. Continuous space language models. Computer Speech
Lang. 21, 492–518 (2007). ISIArticle

continuous space language models

2015-12-11T20:39:03Z

Amirlk: /* Conclusion */

= Introduction =
In certain fields of study such as speech recognition or machine translation, for some acoustic signal <math>\,x</math> or the source sentence to be translated <math>\,e</math>, it is common to model these problems as finding the sequence of words <math>\,w^*</math> that has the highest probability of occurring given <math>\,x</math> or <math>\,e</math>. This can be written as:

<math>w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w)</math>

An acoustic or translation model can then be used for <math>\,P(x|w)</math>, similar to the idea behind LDA and QDA, and it remains to create a language model <math>\,P(w)</math> to estimate the probability of any sequence of words <math>\,w</math>.

This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate <math>\,P(w)</math>.

= Back-off n-grams Model =

A sequence of words will be defined as <math>\,w^i_1=(w_1,w_2,\dots,w_i)</math> and the formula for the probability <math>\,P(w)</math> can be rewritten as:

<math>P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1)</math>

It is common to estimate <math>\,P(w_i|w^{i-1}_1)</math> through:

<math>\,P(w_i|w^{i-1}_1)\approx\frac{\mbox{number of occurrence of the sequence} (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence} (w_1,\dots,w_{i-1})}</math>

However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.

To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:

<math>\,P(w_i|w^{i-1}_1) = \begin{cases}
\frac{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_{i-1})}, & \mbox{if number of occurrence of}\ (w_1,\dots,w_i)\ \mbox{is greater than some constant K} \\
\alpha P(w_i|w^{i-1}_2), & \mbox{otherwise}
\end{cases}</math>

<math>\,\alpha</math> is typically a discounting factor that is less than 1 to account for the lack of direct data. It usually depends on the word sequence.

The general algorithm is then, if the data set does contain the sequence then calculate probability directly. Otherwise, apply a discounting factor and calculate the conditional probability with the first word in the sequence removed. For example, if the word sequence was "The dog barked" and it did not exist in the training set then the formula would be written as:

<math>\,P(\mbox{barked}|\mbox{the,dog}) \approx \alpha P(\mbox{barked}|\mbox{dog})</math>

= Model =

The researchers for this paper sought to find a better model for this probability than the back-off n-grams model. Their approach was to map the n-1 words sequence onto a multi-dimension continuous space using a layer of neural network followed by another layer to estimate the probabilities of all possible next words. The formulas and model goes as follows:

For some sequence of n-1 words, encode each word using 1 of K encoding, i.e. 1 where the word is indexed and zero everywhere else. Label each 1 of K encoding by <math>(w_{j-n+1},\dots,w_j)</math> for some n-1 word sequence at the j'th word in some larger context.

Let P be a projection matrix common to all n-1 words and let

<math>\,a_i=Pw_{j-n+i},i=1,\dots,n-1</math>

Let H be the weight matrix from the projection layer to the hidden layer and the state of H would be:

<math>\,h=tanh(Ha + b)</math> where A is the concatenation of all <math>\,a_i</math> and <math>\,b</math> is some bias vector

Finally, the output vector would be:

<math>\,o=Vh+k</math> where V is the weight matrix from hidden to output and k is another bias vector. <math>\,o</math> would be a vector with same dimensions as the total vocabulary size and the probabilities can be calculated from <math>\,o</math> by applying the softmax function.

= Optimization and Training =
The training was done with standard back-propagation on minimizing the error function:

<math>\,E=\sum_{i=1}^N t_i\ log p_i + \epsilon(\sum_{i,j}h^2_{ij}+\sum_{i,j}v^2_{ij})</math>

<math>\,t_i</math> is the desired output vector and the summations inside the epsilon bracket are regularization terms to prevent overfitting of <math>\,H</math> and <math>\,V</math>.

The researchers used stochastic gradient descent to prevent having to sum over millions of examples worth of error and this sped up training time.

An issue the researchers ran into using this model was that it took a long time to calculate language model probabilities compared to traditional back-off n-grams model and reduced its suitability for real time predictions. To solve this issue, several optimization techniques were used.

===Lattice rescoring===

It is common to keep track of additional possible solutions instead of just the most obviously likely solution in a lattice structure, i.e. a tree like structure where branches can merge and each branch represents a possible solution. For example from the paper using a tri-gram model, i.e. predict third word from first two words, the following lattice structure was formed:

[[File:Lattice.PNG]]

Any particular branch where two nodes have the same words can be merged. For example, "a,problem" was merged in the middle of the lattice because the tri-gram model would estimate the same probability at the point for both branch. Similary, "that_is,not" and "there_is,not" cannot be merged before the preceding two words to predict with are different.

After this structure is created with a traditional back-off n-grams model, the neural network is then used to re-score the lattice and the re-scored lattice is used to make predictions.

===Short List===

In any language, there is usually a small set of commonly used words that form almost all of written or spoken thought. The short-list idea is that rather than calculating every single probability for even the rarest words, the neural network only calculates a small subset of the most common words. This way, the output vector can be significantly shrunk from <math>\,\mbox{N}</math> to some much smaller number <math>\,\mbox{S}</math>.

If any rare words do occur, their probabilities are calculated using the traditional back-off n-grams model. The formula then goes as follows from the paper:

[[File:shortlist.PNG]]

Where L is the event that <math>\,w_t</math> is in the short-list.

===Sorting and Bunch===

The neural network predicts all the probabilities based on some sequence of words. If the probability of two different sequences of words are required but their relationship is such that for sequence 1, <math>\,w=(w_1,\dots,w_{i-1},w_i)</math> and sequence 2, <math>\,w^'=(w_1,\dots,w_{i-1},w^'_i)</math>, they differ only in the last word. Then only a single feed through the neural network is required. This is because the output vector using the context <math>\,(w_1,\dots,w_{i-1})</math> would predict the probabilities for both <math>\,w_i</math> and <math>\,w^'_i</math> being next. Therefore it is efficient to merge any sequence who have the same context.

Modern day computers are also very optimized for linear algebra and it is more efficient to run multiple examples at the same time through the matrix equations. The researchers called this bunching and simple testing showed that this decreased processing time by a factor of 10 when using 128 examples at once compared to 1.

= Training and Usage =

The researchers used numerous optimization techniques during training and their results were summarized in the paper as follows:

[[File:fast_training.PNG]]

Since the model only trains to predict based on the last n-1 words, at certain points there will be less than n-1 words and adjustments must be made. The researchers considered two possibilities, using traditional models for these n-grams or filling up the n-k words with some filler word up to n-1. After some testing, they found that requests for small n-gram probabilities were pretty low and they decided to use traditional back-off n-gram model for these cases.

= Results =

In general the results were quite good. When this neural network + back-off n-grams hybrid was used in combination with a number of acoustic speech recognition models, they found that perplexity, lower the better, decreased by about 10% in a number of cases compared with traditional back-off n-grams only model. Some of their results are summarized as follows:

[[File:results1.PNG]]

[[File:results2.PNG]]

= Conclusion =

This paper described the theory and an experimental evaluation of a new approach to language modeling for large vocabulary continuous speech recognition based on the idea to project the words onto a continuous space and to perform the probability estimation in this space. This method is fast to the level that the neural network language model can be used in a real-time speech recognizer. The necessary capacity of the neural network is an important issue. Three possibilities were explored: increasing the size of the hidden layer, training several networks and interpolating them together, and using large projection layers.

= Source =
Schwenk, H. Continuous space language models. Computer Speech
Lang. 21, 492–518 (2007). ISIArticle

continuous space language models

2015-12-11T20:36:31Z

Amirlk: /* Conclusion */

= Introduction =
In certain fields of study such as speech recognition or machine translation, for some acoustic signal <math>\,x</math> or the source sentence to be translated <math>\,e</math>, it is common to model these problems as finding the sequence of words <math>\,w^*</math> that has the highest probability of occurring given <math>\,x</math> or <math>\,e</math>. This can be written as:

<math>w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w)</math>

An acoustic or translation model can then be used for <math>\,P(x|w)</math>, similar to the idea behind LDA and QDA, and it remains to create a language model <math>\,P(w)</math> to estimate the probability of any sequence of words <math>\,w</math>.

This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate <math>\,P(w)</math>.

= Back-off n-grams Model =

A sequence of words will be defined as <math>\,w^i_1=(w_1,w_2,\dots,w_i)</math> and the formula for the probability <math>\,P(w)</math> can be rewritten as:

<math>P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1)</math>

It is common to estimate <math>\,P(w_i|w^{i-1}_1)</math> through:

<math>\,P(w_i|w^{i-1}_1)\approx\frac{\mbox{number of occurrence of the sequence} (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence} (w_1,\dots,w_{i-1})}</math>

However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.

To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:

<math>\,P(w_i|w^{i-1}_1) = \begin{cases}
\frac{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_{i-1})}, & \mbox{if number of occurrence of}\ (w_1,\dots,w_i)\ \mbox{is greater than some constant K} \\
\alpha P(w_i|w^{i-1}_2), & \mbox{otherwise}
\end{cases}</math>

<math>\,\alpha</math> is typically a discounting factor that is less than 1 to account for the lack of direct data. It usually depends on the word sequence.

The general algorithm is then, if the data set does contain the sequence then calculate probability directly. Otherwise, apply a discounting factor and calculate the conditional probability with the first word in the sequence removed. For example, if the word sequence was "The dog barked" and it did not exist in the training set then the formula would be written as:

<math>\,P(\mbox{barked}|\mbox{the,dog}) \approx \alpha P(\mbox{barked}|\mbox{dog})</math>

= Model =

The researchers for this paper sought to find a better model for this probability than the back-off n-grams model. Their approach was to map the n-1 words sequence onto a multi-dimension continuous space using a layer of neural network followed by another layer to estimate the probabilities of all possible next words. The formulas and model goes as follows:

For some sequence of n-1 words, encode each word using 1 of K encoding, i.e. 1 where the word is indexed and zero everywhere else. Label each 1 of K encoding by <math>(w_{j-n+1},\dots,w_j)</math> for some n-1 word sequence at the j'th word in some larger context.

Let P be a projection matrix common to all n-1 words and let

<math>\,a_i=Pw_{j-n+i},i=1,\dots,n-1</math>

Let H be the weight matrix from the projection layer to the hidden layer and the state of H would be:

<math>\,h=tanh(Ha + b)</math> where A is the concatenation of all <math>\,a_i</math> and <math>\,b</math> is some bias vector

Finally, the output vector would be:

<math>\,o=Vh+k</math> where V is the weight matrix from hidden to output and k is another bias vector. <math>\,o</math> would be a vector with same dimensions as the total vocabulary size and the probabilities can be calculated from <math>\,o</math> by applying the softmax function.

= Optimization and Training =
The training was done with standard back-propagation on minimizing the error function:

<math>\,E=\sum_{i=1}^N t_i\ log p_i + \epsilon(\sum_{i,j}h^2_{ij}+\sum_{i,j}v^2_{ij})</math>

<math>\,t_i</math> is the desired output vector and the summations inside the epsilon bracket are regularization terms to prevent overfitting of <math>\,H</math> and <math>\,V</math>.

The researchers used stochastic gradient descent to prevent having to sum over millions of examples worth of error and this sped up training time.

An issue the researchers ran into using this model was that it took a long time to calculate language model probabilities compared to traditional back-off n-grams model and reduced its suitability for real time predictions. To solve this issue, several optimization techniques were used.

===Lattice rescoring===

It is common to keep track of additional possible solutions instead of just the most obviously likely solution in a lattice structure, i.e. a tree like structure where branches can merge and each branch represents a possible solution. For example from the paper using a tri-gram model, i.e. predict third word from first two words, the following lattice structure was formed:

[[File:Lattice.PNG]]

Any particular branch where two nodes have the same words can be merged. For example, "a,problem" was merged in the middle of the lattice because the tri-gram model would estimate the same probability at the point for both branch. Similary, "that_is,not" and "there_is,not" cannot be merged before the preceding two words to predict with are different.

After this structure is created with a traditional back-off n-grams model, the neural network is then used to re-score the lattice and the re-scored lattice is used to make predictions.

===Short List===

In any language, there is usually a small set of commonly used words that form almost all of written or spoken thought. The short-list idea is that rather than calculating every single probability for even the rarest words, the neural network only calculates a small subset of the most common words. This way, the output vector can be significantly shrunk from <math>\,\mbox{N}</math> to some much smaller number <math>\,\mbox{S}</math>.

If any rare words do occur, their probabilities are calculated using the traditional back-off n-grams model. The formula then goes as follows from the paper:

[[File:shortlist.PNG]]

Where L is the event that <math>\,w_t</math> is in the short-list.

===Sorting and Bunch===

The neural network predicts all the probabilities based on some sequence of words. If the probability of two different sequences of words are required but their relationship is such that for sequence 1, <math>\,w=(w_1,\dots,w_{i-1},w_i)</math> and sequence 2, <math>\,w^'=(w_1,\dots,w_{i-1},w^'_i)</math>, they differ only in the last word. Then only a single feed through the neural network is required. This is because the output vector using the context <math>\,(w_1,\dots,w_{i-1})</math> would predict the probabilities for both <math>\,w_i</math> and <math>\,w^'_i</math> being next. Therefore it is efficient to merge any sequence who have the same context.

Modern day computers are also very optimized for linear algebra and it is more efficient to run multiple examples at the same time through the matrix equations. The researchers called this bunching and simple testing showed that this decreased processing time by a factor of 10 when using 128 examples at once compared to 1.

= Training and Usage =

The researchers used numerous optimization techniques during training and their results were summarized in the paper as follows:

[[File:fast_training.PNG]]

Since the model only trains to predict based on the last n-1 words, at certain points there will be less than n-1 words and adjustments must be made. The researchers considered two possibilities, using traditional models for these n-grams or filling up the n-k words with some filler word up to n-1. After some testing, they found that requests for small n-gram probabilities were pretty low and they decided to use traditional back-off n-gram model for these cases.

= Results =

In general the results were quite good. When this neural network + back-off n-grams hybrid was used in combination with a number of acoustic speech recognition models, they found that perplexity, lower the better, decreased by about 10% in a number of cases compared with traditional back-off n-grams only model. Some of their results are summarized as follows:

[[File:results1.PNG]]

[[File:results2.PNG]]

= Conclusion =

This paper described the theory and an experimental evaluation of a new approach to language modeling for large vocabulary continuous speech recognition based on the idea to project the words onto a continuous space and to perform the probability estimation in this space.

= Source =
Schwenk, H. Continuous space language models. Computer Speech
Lang. 21, 492–518 (2007). ISIArticle

continuous space language models

2015-12-11T20:34:43Z

Amirlk:

= Introduction =
In certain fields of study such as speech recognition or machine translation, for some acoustic signal <math>\,x</math> or the source sentence to be translated <math>\,e</math>, it is common to model these problems as finding the sequence of words <math>\,w^*</math> that has the highest probability of occurring given <math>\,x</math> or <math>\,e</math>. This can be written as:

<math>w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w)</math>

An acoustic or translation model can then be used for <math>\,P(x|w)</math>, similar to the idea behind LDA and QDA, and it remains to create a language model <math>\,P(w)</math> to estimate the probability of any sequence of words <math>\,w</math>.

This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate <math>\,P(w)</math>.

= Back-off n-grams Model =

A sequence of words will be defined as <math>\,w^i_1=(w_1,w_2,\dots,w_i)</math> and the formula for the probability <math>\,P(w)</math> can be rewritten as:

<math>P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1)</math>

It is common to estimate <math>\,P(w_i|w^{i-1}_1)</math> through:

<math>\,P(w_i|w^{i-1}_1)\approx\frac{\mbox{number of occurrence of the sequence} (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence} (w_1,\dots,w_{i-1})}</math>

However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.

To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:

<math>\,P(w_i|w^{i-1}_1) = \begin{cases}
\frac{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_{i-1})}, & \mbox{if number of occurrence of}\ (w_1,\dots,w_i)\ \mbox{is greater than some constant K} \\
\alpha P(w_i|w^{i-1}_2), & \mbox{otherwise}
\end{cases}</math>

<math>\,\alpha</math> is typically a discounting factor that is less than 1 to account for the lack of direct data. It usually depends on the word sequence.

The general algorithm is then, if the data set does contain the sequence then calculate probability directly. Otherwise, apply a discounting factor and calculate the conditional probability with the first word in the sequence removed. For example, if the word sequence was "The dog barked" and it did not exist in the training set then the formula would be written as:

<math>\,P(\mbox{barked}|\mbox{the,dog}) \approx \alpha P(\mbox{barked}|\mbox{dog})</math>

= Model =

The researchers for this paper sought to find a better model for this probability than the back-off n-grams model. Their approach was to map the n-1 words sequence onto a multi-dimension continuous space using a layer of neural network followed by another layer to estimate the probabilities of all possible next words. The formulas and model goes as follows:

For some sequence of n-1 words, encode each word using 1 of K encoding, i.e. 1 where the word is indexed and zero everywhere else. Label each 1 of K encoding by <math>(w_{j-n+1},\dots,w_j)</math> for some n-1 word sequence at the j'th word in some larger context.

Let P be a projection matrix common to all n-1 words and let

<math>\,a_i=Pw_{j-n+i},i=1,\dots,n-1</math>

Let H be the weight matrix from the projection layer to the hidden layer and the state of H would be:

<math>\,h=tanh(Ha + b)</math> where A is the concatenation of all <math>\,a_i</math> and <math>\,b</math> is some bias vector

Finally, the output vector would be:

<math>\,o=Vh+k</math> where V is the weight matrix from hidden to output and k is another bias vector. <math>\,o</math> would be a vector with same dimensions as the total vocabulary size and the probabilities can be calculated from <math>\,o</math> by applying the softmax function.

= Optimization and Training =
The training was done with standard back-propagation on minimizing the error function:

<math>\,E=\sum_{i=1}^N t_i\ log p_i + \epsilon(\sum_{i,j}h^2_{ij}+\sum_{i,j}v^2_{ij})</math>

<math>\,t_i</math> is the desired output vector and the summations inside the epsilon bracket are regularization terms to prevent overfitting of <math>\,H</math> and <math>\,V</math>.

The researchers used stochastic gradient descent to prevent having to sum over millions of examples worth of error and this sped up training time.

An issue the researchers ran into using this model was that it took a long time to calculate language model probabilities compared to traditional back-off n-grams model and reduced its suitability for real time predictions. To solve this issue, several optimization techniques were used.

===Lattice rescoring===

It is common to keep track of additional possible solutions instead of just the most obviously likely solution in a lattice structure, i.e. a tree like structure where branches can merge and each branch represents a possible solution. For example from the paper using a tri-gram model, i.e. predict third word from first two words, the following lattice structure was formed:

[[File:Lattice.PNG]]

Any particular branch where two nodes have the same words can be merged. For example, "a,problem" was merged in the middle of the lattice because the tri-gram model would estimate the same probability at the point for both branch. Similary, "that_is,not" and "there_is,not" cannot be merged before the preceding two words to predict with are different.

After this structure is created with a traditional back-off n-grams model, the neural network is then used to re-score the lattice and the re-scored lattice is used to make predictions.

===Short List===

In any language, there is usually a small set of commonly used words that form almost all of written or spoken thought. The short-list idea is that rather than calculating every single probability for even the rarest words, the neural network only calculates a small subset of the most common words. This way, the output vector can be significantly shrunk from <math>\,\mbox{N}</math> to some much smaller number <math>\,\mbox{S}</math>.

If any rare words do occur, their probabilities are calculated using the traditional back-off n-grams model. The formula then goes as follows from the paper:

[[File:shortlist.PNG]]

Where L is the event that <math>\,w_t</math> is in the short-list.

===Sorting and Bunch===

The neural network predicts all the probabilities based on some sequence of words. If the probability of two different sequences of words are required but their relationship is such that for sequence 1, <math>\,w=(w_1,\dots,w_{i-1},w_i)</math> and sequence 2, <math>\,w^'=(w_1,\dots,w_{i-1},w^'_i)</math>, they differ only in the last word. Then only a single feed through the neural network is required. This is because the output vector using the context <math>\,(w_1,\dots,w_{i-1})</math> would predict the probabilities for both <math>\,w_i</math> and <math>\,w^'_i</math> being next. Therefore it is efficient to merge any sequence who have the same context.

Modern day computers are also very optimized for linear algebra and it is more efficient to run multiple examples at the same time through the matrix equations. The researchers called this bunching and simple testing showed that this decreased processing time by a factor of 10 when using 128 examples at once compared to 1.

= Training and Usage =

The researchers used numerous optimization techniques during training and their results were summarized in the paper as follows:

[[File:fast_training.PNG]]

Since the model only trains to predict based on the last n-1 words, at certain points there will be less than n-1 words and adjustments must be made. The researchers considered two possibilities, using traditional models for these n-grams or filling up the n-k words with some filler word up to n-1. After some testing, they found that requests for small n-gram probabilities were pretty low and they decided to use traditional back-off n-gram model for these cases.

= Results =

In general the results were quite good. When this neural network + back-off n-grams hybrid was used in combination with a number of acoustic speech recognition models, they found that perplexity, lower the better, decreased by about 10% in a number of cases compared with traditional back-off n-grams only model. Some of their results are summarized as follows:

[[File:results1.PNG]]

[[File:results2.PNG]]

= Conclusion =

= Source =
Schwenk, H. Continuous space language models. Computer Speech
Lang. 21, 492–518 (2007). ISIArticle

scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines

2015-12-11T20:21:45Z

Amirlk: /* Model */

= Introduction =

This paper<ref>
Farabet, Clement, et al. [http://arxiv.org/pdf/1202.2160v2.pdf "Scene parsing with multiscale feature learning, purity trees, and optimal covers."] arXiv preprint arXiv:1202.2160 (2012).
</ref> presents an approach to full scene labelling (FSL). This is the task of giving a label to each pixel in an image corresponding to which category of object it belongs to. FSL involves solving the problems of detection, segmentation, recognition, and contextual integration simultaneously. One of the main obstacles of FSL is that the information required for labelling a particular pixel could come from very distant pixels as well as their labels. This distance often depends on the particular label as well (e.g. the presence of a wheel might mean there is a vehicle nearby, while an object like the sky or water could span the entire image, and figuring out to which class a particular blue pixel belongs could be challenging).

= Overview =

The proposed method for FSL works by first computing a tree of segments from a graph of pixel dissimilarities. A set of dense feature vectors is then computed, encoding regions of multiple sizes centered on each pixel. Feature vectors are aggregated and fed to a classifier which estimates the distribution of object categories in a segment. A subset of tree nodes that cover the image are selected to maximize the average "purity" of the class distributions (i.e. maximizing the likelihood that each segment will contain a single object). The convolutional network feature extractor is trained end-to-end from raw pixels, so there is no need for engineered features.

There are five main ingredients to this new method for FSL:

# Trainable, dense, multi-scale feature extraction
# Segmentation tree
# Regionwise feature aggregation
# Class histogram estimation
# Optimal purity cover

The three main contributions of this paper are:

# Using a multi-scale convolutional net to learn good features for region classification
# Using a class purity criterion to decide if a segment contains a single object, as opposed to several objects, or part of an object
# An efficient procedure to obtain a cover that optimizes the overall class purity of a segmentation

= Previous Work =

Most previous methods of FSL rely on MRFs, CRFs, or other types of graphical models to ensure consistency in the labeling and to account for context. This is typically done using a pre-segmentation into super-pixels or other segment candidates. Features and categories are then extracted from individual segments and combinations of neighboring segments.

Using trees allows the use of fast inference algorithms based on graph cuts or other methods. In this paper, an innovative method based on finding a set of tree nodes that cover the images while minimizing some criterion is used.

= Model =

This model relies on two complementary image representations. In the first representation, the image is seen as a point in a high-dimensional space, and we seek to find a transform <math>f: \mathbb{R}^P \rightarrow \mathbb{R}^Q</math> that maps these images into a space in which each pixel can be assigned a label using a simple linear classifier. In the second representation, the image is seen as an edge-weighted graph, on which a hierarchy of segmentations/clusterings can be constructed. This representation yields a natural abstraction of the original pixel grid, and provides a hierarchy of observation levels for all the objects in the image. The full model is shown in the diagram below. It is an end-to-end trainable model for scene parsing.

[[File:SceneModelDiagram.png]]

== Pre-processing ==

Before being put into the Convolutional Neural Network (CNN) multiple scaled versions of the image are generated. The set of these scaled images is called a ''pyramid''. There were three different scale outputs of the image created, in a similar manner shown in the picture below

[[File:Image_pyramid.png ]]

The scaling can be done by different transforms; the paper suggests to use the Laplacian transform. The Laplacian is the sum of partial second derivatives <math>\nabla^2 f = \frac{\partial^2 f}{\partial x^2} + \frac{\partial^2 f}{\partial y^2}</math>. A two-dimensional discrete approximation is given by the matrix <math>\left[\begin{array}{ccc}0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0\end{array}\right]</math>.

== Network Architecture ==

More holistic tasks, such as full-scene understanding (pixel-wise labeling, or any dense feature estimation) require the system to model complex interactions at the scale of complete images, not simply within a patch. In this problem the dimensionality becomes unmanageable: for a typical image of 256×256 pixels, a naive neural network would require millions of parameters, and a naive convolutional network would require filters that are unreasonably large to view enough context. The multiscale convolutional network overcomes these limitations by extending the concept of weight replication to the scale space. The more scales used to jointly train the models, the better the representation becomes for all scales. Using the same function to extract features at each scale is justified because the image content is scale invariant in principle. The authors noted that they observed worse performance when the weight sharing was removed.

== Post-Processing ==

In this model the sampling is done using an elastic max-pooling function, which remaps input patterns of arbitrary size into a fixed G×G grid (in this case a 5x5 grid was used). This grid can be seen as a highly invariant representation that encodes spatial relations between an object’s attributes/parts. This representation is denoted Ok and is shown in the diagram below. With this encoding elongated or ill-shaped objects are nicely handled. The dominant features are also used to represent the object, and when combined with background subtraction, these features represent good basis functions to recognize the underlying object. These features are then associated to the corresponding areas of the tree segmentation of the image (generated by creating a minimum spanning tree from the dissimilarity graph of neighboring pixels) for optimal cover calculation.

[[File:SceneGridFeatures.png]]

One of the important features of this model is its method for optimal cover, which is detailed in the diagram below. The leaf nodes represent pixels in the image and a subset of tree nodes are selected whose aggregate children span the entire image. The nodes are selected to minimize the average "impurity" of the class distribution (i.e. the entropy). The cover attempts to find an overall consisten segmentation, where each selected node corresponds to a particular class labelling for itself and all of its unselected children.

[[File:SceneOptimalCover.png]]

== Training ==

Training is done in a two step process. First, the low level feature extractor <math>f_s</math> is trained to produce features that are maximally discriminative. Then, the classifier <math>c</math> is trained to predict the distriubiton of casses in a component. The feature vectors are obtained by concatenating the network outputs for different scales of the multiscale pyramid. To train for them the loss function
<math>L_{\mathrm{cat}} = - \sum_{i \in \mathrm{pixels}, a \in \mathrm{classes}} c_{i,a} \ln(\hat{c}_{i,a})</math>
is used, where <math>c_i</math> is the true (classification) target vector and <math>\hat{c}_i</math> the prediction from a linear classifier (which is only used in this step and will be discarded later).

After training parameters for the feature extraction, parameters of the actual classifier is trained my minimizing the Kullback-Leibler-divergence (KL-divergence) between the true distribution of labels in each component and the prediction from the classifier. The KL-divergence is a measure of the difference between two probability distributions.

= Experiments =

For all experiments, a 2-stage convolutional network was used. The input is a 3-channel image, and it is transformed into a 16-dimensional feature map, using a bank of 16 7x7 filters followed by tanh units. This feature map is then pooled using a 2x2 max-pooling layer. The second layer transforms the 16-dimensional feature map into a 64-dimensional feature map, with each component being produced by a combination of 8 7x7 filters (for an effective total of 512 filters), followed by tanh units. This map is also pooled using a 2x2 max-pooling layer. This 64-dimensional feature map is transformed into a 256-dimensional feature map by using a combination of 16 7x7 filters (2048 filters).

The network is applied to a locally normalized Laplacian pyramid constructed on the input image. The pyramid contains three rescaled versions of the input: 320x240, 160x120, and 80x60. All of the inputs are properly padded and the outputs of each of the three networks are upsampled and concatenated to produce a 768-dimensional feature vector map (256x3). The network is trained on all three scales in parallel.

A simple grid search was used to find the best learning rate and regularization parameters (weight decay). A holdout of 10% of the training data was used as a validation set during the parameter search. For both datasets, jitter was used to artificially expand the size of the training data, to try to allow features to not overfit irrelevant biases present in the data. This jitter included horizontal flipping, and rotations between -8 and 8 degrees.

The hierarchy used to find the optimal cover is a constructed on the raw image gradient, based on a standard volume criterion<ref>
F. Meyer and L. Najman. [http://onlinelibrary.wiley.com/doi/10.1002/9781118600788.ch9/summary "Segmentation, minimum spanning tree and hierarchies."] In L. Najman and H. Talbot, editors, Mathematical Morphology: from theory to application, chapter 9, pages 229–261. ISTE-Wiley, London, 2010.
</ref><ref>
J. Cousty and L. Najman. [http://link.springer.com/chapter/10.1007/978-3-642-21569-8_24 "Incremental algorithm for hierarchical minimum spanning forests and saliency of watershed cuts."] In 10th International Symposium on Mathematical Morphology (ISMM’11), LNCS, 2011.
</ref>, completed by removing non-informative small components (less than 100 pixels). Traditionally segmentation methods use a partition of segments (i.e. finding an optimal cut in the tree) rather than a cover. A number of graph cut methods were tried, but the results were systematically worse than the optimal cover method.

Two sampling methods for learning the multiscale features were tried on each dataset. One uses the natural frequencies of each class in the dataset, while the other balances them so that an equal number of each class is shown to the network. The results from each of these methods varied with the dataset used and are reported in the tables below. The authors only included the results for the frequency balancing method for the Stanford Background dataset as it consistently gave better results, but it could still be useful to have the results from the other method to help guide future work. Training with balanced frequencies allows better discrimination of small objects, and although it tends to have lower overall pixel-wise accuracy, it performs better from a recognition point of view. This observation can be seen in the tables below. The per-pixel accuracy for frequency balancing in the Barcelona dataset is quite poor, which the authors attribute by the fact that the dataset has a large amount of classes with very few training examples, leading to overfitting when trying to model them in this manner.

= Results =

[[File:SceneResultTableStanford.png]]

[[File:SceneResultTableSIFT.png]]

[[File:SceneResultTableBarcelona.png]]

[[File:SceneResultPictures.png]]

=References=
<references />

scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines

2015-12-11T20:20:54Z

Amirlk: /* Model */

= Introduction =

This paper<ref>
Farabet, Clement, et al. [http://arxiv.org/pdf/1202.2160v2.pdf "Scene parsing with multiscale feature learning, purity trees, and optimal covers."] arXiv preprint arXiv:1202.2160 (2012).
</ref> presents an approach to full scene labelling (FSL). This is the task of giving a label to each pixel in an image corresponding to which category of object it belongs to. FSL involves solving the problems of detection, segmentation, recognition, and contextual integration simultaneously. One of the main obstacles of FSL is that the information required for labelling a particular pixel could come from very distant pixels as well as their labels. This distance often depends on the particular label as well (e.g. the presence of a wheel might mean there is a vehicle nearby, while an object like the sky or water could span the entire image, and figuring out to which class a particular blue pixel belongs could be challenging).

= Overview =

The proposed method for FSL works by first computing a tree of segments from a graph of pixel dissimilarities. A set of dense feature vectors is then computed, encoding regions of multiple sizes centered on each pixel. Feature vectors are aggregated and fed to a classifier which estimates the distribution of object categories in a segment. A subset of tree nodes that cover the image are selected to maximize the average "purity" of the class distributions (i.e. maximizing the likelihood that each segment will contain a single object). The convolutional network feature extractor is trained end-to-end from raw pixels, so there is no need for engineered features.

There are five main ingredients to this new method for FSL:

# Trainable, dense, multi-scale feature extraction
# Segmentation tree
# Regionwise feature aggregation
# Class histogram estimation
# Optimal purity cover

The three main contributions of this paper are:

# Using a multi-scale convolutional net to learn good features for region classification
# Using a class purity criterion to decide if a segment contains a single object, as opposed to several objects, or part of an object
# An efficient procedure to obtain a cover that optimizes the overall class purity of a segmentation

= Previous Work =

Most previous methods of FSL rely on MRFs, CRFs, or other types of graphical models to ensure consistency in the labeling and to account for context. This is typically done using a pre-segmentation into super-pixels or other segment candidates. Features and categories are then extracted from individual segments and combinations of neighboring segments.

Using trees allows the use of fast inference algorithms based on graph cuts or other methods. In this paper, an innovative method based on finding a set of tree nodes that cover the images while minimizing some criterion is used.

= Model =

This model relies on two complementary image representations. In the first representation, the image is seen as a point in a high-dimensional space, and we seek to find a transform <math>f: \mathbb{R}^P \rightarrow \mathbb{R}^Q</math> that maps these images into a space in which each pixel can be assigned a label using a simple linear classifier. In the second representation, the image is seen as an edge-weighted graph, on which a hierarchy of segmentations/clusterings can be constructed. This representation yields a natural abstraction of the original pixel grid, and provides a hierarchy of observation levels for all the objects in the image.

The full model is shown in the diagram below. It is an end-to-end trainable model for scene parsing.

[[File:SceneModelDiagram.png]]

== Pre-processing ==

Before being put into the Convolutional Neural Network (CNN) multiple scaled versions of the image are generated. The set of these scaled images is called a ''pyramid''. There were three different scale outputs of the image created, in a similar manner shown in the picture below

[[File:Image_pyramid.png ]]

The scaling can be done by different transforms; the paper suggests to use the Laplacian transform. The Laplacian is the sum of partial second derivatives <math>\nabla^2 f = \frac{\partial^2 f}{\partial x^2} + \frac{\partial^2 f}{\partial y^2}</math>. A two-dimensional discrete approximation is given by the matrix <math>\left[\begin{array}{ccc}0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0\end{array}\right]</math>.

== Network Architecture ==

More holistic tasks, such as full-scene understanding (pixel-wise labeling, or any dense feature estimation) require the system to model complex interactions at the scale of complete images, not simply within a patch. In this problem the dimensionality becomes unmanageable: for a typical image of 256×256 pixels, a naive neural network would require millions of parameters, and a naive convolutional network would require filters that are unreasonably large to view enough context. The multiscale convolutional network overcomes these limitations by extending the concept of weight replication to the scale space. The more scales used to jointly train the models, the better the representation becomes for all scales. Using the same function to extract features at each scale is justified because the image content is scale invariant in principle. The authors noted that they observed worse performance when the weight sharing was removed.

== Post-Processing ==

In this model the sampling is done using an elastic max-pooling function, which remaps input patterns of arbitrary size into a fixed G×G grid (in this case a 5x5 grid was used). This grid can be seen as a highly invariant representation that encodes spatial relations between an object’s attributes/parts. This representation is denoted Ok and is shown in the diagram below. With this encoding elongated or ill-shaped objects are nicely handled. The dominant features are also used to represent the object, and when combined with background subtraction, these features represent good basis functions to recognize the underlying object. These features are then associated to the corresponding areas of the tree segmentation of the image (generated by creating a minimum spanning tree from the dissimilarity graph of neighboring pixels) for optimal cover calculation.

[[File:SceneGridFeatures.png]]

One of the important features of this model is its method for optimal cover, which is detailed in the diagram below. The leaf nodes represent pixels in the image and a subset of tree nodes are selected whose aggregate children span the entire image. The nodes are selected to minimize the average "impurity" of the class distribution (i.e. the entropy). The cover attempts to find an overall consisten segmentation, where each selected node corresponds to a particular class labelling for itself and all of its unselected children.

[[File:SceneOptimalCover.png]]

== Training ==

Training is done in a two step process. First, the low level feature extractor <math>f_s</math> is trained to produce features that are maximally discriminative. Then, the classifier <math>c</math> is trained to predict the distriubiton of casses in a component. The feature vectors are obtained by concatenating the network outputs for different scales of the multiscale pyramid. To train for them the loss function
<math>L_{\mathrm{cat}} = - \sum_{i \in \mathrm{pixels}, a \in \mathrm{classes}} c_{i,a} \ln(\hat{c}_{i,a})</math>
is used, where <math>c_i</math> is the true (classification) target vector and <math>\hat{c}_i</math> the prediction from a linear classifier (which is only used in this step and will be discarded later).

After training parameters for the feature extraction, parameters of the actual classifier is trained my minimizing the Kullback-Leibler-divergence (KL-divergence) between the true distribution of labels in each component and the prediction from the classifier. The KL-divergence is a measure of the difference between two probability distributions.

= Experiments =

For all experiments, a 2-stage convolutional network was used. The input is a 3-channel image, and it is transformed into a 16-dimensional feature map, using a bank of 16 7x7 filters followed by tanh units. This feature map is then pooled using a 2x2 max-pooling layer. The second layer transforms the 16-dimensional feature map into a 64-dimensional feature map, with each component being produced by a combination of 8 7x7 filters (for an effective total of 512 filters), followed by tanh units. This map is also pooled using a 2x2 max-pooling layer. This 64-dimensional feature map is transformed into a 256-dimensional feature map by using a combination of 16 7x7 filters (2048 filters).

The network is applied to a locally normalized Laplacian pyramid constructed on the input image. The pyramid contains three rescaled versions of the input: 320x240, 160x120, and 80x60. All of the inputs are properly padded and the outputs of each of the three networks are upsampled and concatenated to produce a 768-dimensional feature vector map (256x3). The network is trained on all three scales in parallel.

A simple grid search was used to find the best learning rate and regularization parameters (weight decay). A holdout of 10% of the training data was used as a validation set during the parameter search. For both datasets, jitter was used to artificially expand the size of the training data, to try to allow features to not overfit irrelevant biases present in the data. This jitter included horizontal flipping, and rotations between -8 and 8 degrees.

The hierarchy used to find the optimal cover is a constructed on the raw image gradient, based on a standard volume criterion<ref>
F. Meyer and L. Najman. [http://onlinelibrary.wiley.com/doi/10.1002/9781118600788.ch9/summary "Segmentation, minimum spanning tree and hierarchies."] In L. Najman and H. Talbot, editors, Mathematical Morphology: from theory to application, chapter 9, pages 229–261. ISTE-Wiley, London, 2010.
</ref><ref>
J. Cousty and L. Najman. [http://link.springer.com/chapter/10.1007/978-3-642-21569-8_24 "Incremental algorithm for hierarchical minimum spanning forests and saliency of watershed cuts."] In 10th International Symposium on Mathematical Morphology (ISMM’11), LNCS, 2011.
</ref>, completed by removing non-informative small components (less than 100 pixels). Traditionally segmentation methods use a partition of segments (i.e. finding an optimal cut in the tree) rather than a cover. A number of graph cut methods were tried, but the results were systematically worse than the optimal cover method.

Two sampling methods for learning the multiscale features were tried on each dataset. One uses the natural frequencies of each class in the dataset, while the other balances them so that an equal number of each class is shown to the network. The results from each of these methods varied with the dataset used and are reported in the tables below. The authors only included the results for the frequency balancing method for the Stanford Background dataset as it consistently gave better results, but it could still be useful to have the results from the other method to help guide future work. Training with balanced frequencies allows better discrimination of small objects, and although it tends to have lower overall pixel-wise accuracy, it performs better from a recognition point of view. This observation can be seen in the tables below. The per-pixel accuracy for frequency balancing in the Barcelona dataset is quite poor, which the authors attribute by the fact that the dataset has a large amount of classes with very few training examples, leading to overfitting when trying to model them in this manner.

= Results =

[[File:SceneResultTableStanford.png]]

[[File:SceneResultTableSIFT.png]]

[[File:SceneResultTableBarcelona.png]]

[[File:SceneResultPictures.png]]

=References=
<references />

scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines

2015-12-11T20:19:36Z

Amirlk: /* Model */

= Introduction =

This paper<ref>
Farabet, Clement, et al. [http://arxiv.org/pdf/1202.2160v2.pdf "Scene parsing with multiscale feature learning, purity trees, and optimal covers."] arXiv preprint arXiv:1202.2160 (2012).
</ref> presents an approach to full scene labelling (FSL). This is the task of giving a label to each pixel in an image corresponding to which category of object it belongs to. FSL involves solving the problems of detection, segmentation, recognition, and contextual integration simultaneously. One of the main obstacles of FSL is that the information required for labelling a particular pixel could come from very distant pixels as well as their labels. This distance often depends on the particular label as well (e.g. the presence of a wheel might mean there is a vehicle nearby, while an object like the sky or water could span the entire image, and figuring out to which class a particular blue pixel belongs could be challenging).

= Overview =

The proposed method for FSL works by first computing a tree of segments from a graph of pixel dissimilarities. A set of dense feature vectors is then computed, encoding regions of multiple sizes centered on each pixel. Feature vectors are aggregated and fed to a classifier which estimates the distribution of object categories in a segment. A subset of tree nodes that cover the image are selected to maximize the average "purity" of the class distributions (i.e. maximizing the likelihood that each segment will contain a single object). The convolutional network feature extractor is trained end-to-end from raw pixels, so there is no need for engineered features.

There are five main ingredients to this new method for FSL:

# Trainable, dense, multi-scale feature extraction
# Segmentation tree
# Regionwise feature aggregation
# Class histogram estimation
# Optimal purity cover

The three main contributions of this paper are:

# Using a multi-scale convolutional net to learn good features for region classification
# Using a class purity criterion to decide if a segment contains a single object, as opposed to several objects, or part of an object
# An efficient procedure to obtain a cover that optimizes the overall class purity of a segmentation

= Previous Work =

Most previous methods of FSL rely on MRFs, CRFs, or other types of graphical models to ensure consistency in the labeling and to account for context. This is typically done using a pre-segmentation into super-pixels or other segment candidates. Features and categories are then extracted from individual segments and combinations of neighboring segments.

Using trees allows the use of fast inference algorithms based on graph cuts or other methods. In this paper, an innovative method based on finding a set of tree nodes that cover the images while minimizing some criterion is used.

= Model =

This model relies on two complementary image representations. In the first representation, the image is seen as a point in a high-dimensional space, and we seek to find a transform <math>f: \mathbb{R}^P \rightarrow \mathbb{R}^Q</math> that maps these images into a space in which each pixel can be assigned a label using a simple linear classifier.

The full model is shown in the diagram below. It is an end-to-end trainable model for scene parsing.

[[File:SceneModelDiagram.png]]

== Pre-processing ==

Before being put into the Convolutional Neural Network (CNN) multiple scaled versions of the image are generated. The set of these scaled images is called a ''pyramid''. There were three different scale outputs of the image created, in a similar manner shown in the picture below

[[File:Image_pyramid.png ]]

The scaling can be done by different transforms; the paper suggests to use the Laplacian transform. The Laplacian is the sum of partial second derivatives <math>\nabla^2 f = \frac{\partial^2 f}{\partial x^2} + \frac{\partial^2 f}{\partial y^2}</math>. A two-dimensional discrete approximation is given by the matrix <math>\left[\begin{array}{ccc}0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0\end{array}\right]</math>.

== Network Architecture ==

More holistic tasks, such as full-scene understanding (pixel-wise labeling, or any dense feature estimation) require the system to model complex interactions at the scale of complete images, not simply within a patch. In this problem the dimensionality becomes unmanageable: for a typical image of 256×256 pixels, a naive neural network would require millions of parameters, and a naive convolutional network would require filters that are unreasonably large to view enough context. The multiscale convolutional network overcomes these limitations by extending the concept of weight replication to the scale space. The more scales used to jointly train the models, the better the representation becomes for all scales. Using the same function to extract features at each scale is justified because the image content is scale invariant in principle. The authors noted that they observed worse performance when the weight sharing was removed.

== Post-Processing ==

In this model the sampling is done using an elastic max-pooling function, which remaps input patterns of arbitrary size into a fixed G×G grid (in this case a 5x5 grid was used). This grid can be seen as a highly invariant representation that encodes spatial relations between an object’s attributes/parts. This representation is denoted Ok and is shown in the diagram below. With this encoding elongated or ill-shaped objects are nicely handled. The dominant features are also used to represent the object, and when combined with background subtraction, these features represent good basis functions to recognize the underlying object. These features are then associated to the corresponding areas of the tree segmentation of the image (generated by creating a minimum spanning tree from the dissimilarity graph of neighboring pixels) for optimal cover calculation.

[[File:SceneGridFeatures.png]]

One of the important features of this model is its method for optimal cover, which is detailed in the diagram below. The leaf nodes represent pixels in the image and a subset of tree nodes are selected whose aggregate children span the entire image. The nodes are selected to minimize the average "impurity" of the class distribution (i.e. the entropy). The cover attempts to find an overall consisten segmentation, where each selected node corresponds to a particular class labelling for itself and all of its unselected children.

[[File:SceneOptimalCover.png]]

== Training ==

Training is done in a two step process. First, the low level feature extractor <math>f_s</math> is trained to produce features that are maximally discriminative. Then, the classifier <math>c</math> is trained to predict the distriubiton of casses in a component. The feature vectors are obtained by concatenating the network outputs for different scales of the multiscale pyramid. To train for them the loss function
<math>L_{\mathrm{cat}} = - \sum_{i \in \mathrm{pixels}, a \in \mathrm{classes}} c_{i,a} \ln(\hat{c}_{i,a})</math>
is used, where <math>c_i</math> is the true (classification) target vector and <math>\hat{c}_i</math> the prediction from a linear classifier (which is only used in this step and will be discarded later).

After training parameters for the feature extraction, parameters of the actual classifier is trained my minimizing the Kullback-Leibler-divergence (KL-divergence) between the true distribution of labels in each component and the prediction from the classifier. The KL-divergence is a measure of the difference between two probability distributions.

= Experiments =

For all experiments, a 2-stage convolutional network was used. The input is a 3-channel image, and it is transformed into a 16-dimensional feature map, using a bank of 16 7x7 filters followed by tanh units. This feature map is then pooled using a 2x2 max-pooling layer. The second layer transforms the 16-dimensional feature map into a 64-dimensional feature map, with each component being produced by a combination of 8 7x7 filters (for an effective total of 512 filters), followed by tanh units. This map is also pooled using a 2x2 max-pooling layer. This 64-dimensional feature map is transformed into a 256-dimensional feature map by using a combination of 16 7x7 filters (2048 filters).

The network is applied to a locally normalized Laplacian pyramid constructed on the input image. The pyramid contains three rescaled versions of the input: 320x240, 160x120, and 80x60. All of the inputs are properly padded and the outputs of each of the three networks are upsampled and concatenated to produce a 768-dimensional feature vector map (256x3). The network is trained on all three scales in parallel.

A simple grid search was used to find the best learning rate and regularization parameters (weight decay). A holdout of 10% of the training data was used as a validation set during the parameter search. For both datasets, jitter was used to artificially expand the size of the training data, to try to allow features to not overfit irrelevant biases present in the data. This jitter included horizontal flipping, and rotations between -8 and 8 degrees.

The hierarchy used to find the optimal cover is a constructed on the raw image gradient, based on a standard volume criterion<ref>
F. Meyer and L. Najman. [http://onlinelibrary.wiley.com/doi/10.1002/9781118600788.ch9/summary "Segmentation, minimum spanning tree and hierarchies."] In L. Najman and H. Talbot, editors, Mathematical Morphology: from theory to application, chapter 9, pages 229–261. ISTE-Wiley, London, 2010.
</ref><ref>
J. Cousty and L. Najman. [http://link.springer.com/chapter/10.1007/978-3-642-21569-8_24 "Incremental algorithm for hierarchical minimum spanning forests and saliency of watershed cuts."] In 10th International Symposium on Mathematical Morphology (ISMM’11), LNCS, 2011.
</ref>, completed by removing non-informative small components (less than 100 pixels). Traditionally segmentation methods use a partition of segments (i.e. finding an optimal cut in the tree) rather than a cover. A number of graph cut methods were tried, but the results were systematically worse than the optimal cover method.

Two sampling methods for learning the multiscale features were tried on each dataset. One uses the natural frequencies of each class in the dataset, while the other balances them so that an equal number of each class is shown to the network. The results from each of these methods varied with the dataset used and are reported in the tables below. The authors only included the results for the frequency balancing method for the Stanford Background dataset as it consistently gave better results, but it could still be useful to have the results from the other method to help guide future work. Training with balanced frequencies allows better discrimination of small objects, and although it tends to have lower overall pixel-wise accuracy, it performs better from a recognition point of view. This observation can be seen in the tables below. The per-pixel accuracy for frequency balancing in the Barcelona dataset is quite poor, which the authors attribute by the fact that the dataset has a large amount of classes with very few training examples, leading to overfitting when trying to model them in this manner.

= Results =

[[File:SceneResultTableStanford.png]]

[[File:SceneResultTableSIFT.png]]

[[File:SceneResultTableBarcelona.png]]

[[File:SceneResultPictures.png]]

=References=
<references />

scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines

2015-12-11T20:19:22Z

Amirlk: /* Model */

= Introduction =

This paper<ref>
Farabet, Clement, et al. [http://arxiv.org/pdf/1202.2160v2.pdf "Scene parsing with multiscale feature learning, purity trees, and optimal covers."] arXiv preprint arXiv:1202.2160 (2012).
</ref> presents an approach to full scene labelling (FSL). This is the task of giving a label to each pixel in an image corresponding to which category of object it belongs to. FSL involves solving the problems of detection, segmentation, recognition, and contextual integration simultaneously. One of the main obstacles of FSL is that the information required for labelling a particular pixel could come from very distant pixels as well as their labels. This distance often depends on the particular label as well (e.g. the presence of a wheel might mean there is a vehicle nearby, while an object like the sky or water could span the entire image, and figuring out to which class a particular blue pixel belongs could be challenging).

= Overview =

The proposed method for FSL works by first computing a tree of segments from a graph of pixel dissimilarities. A set of dense feature vectors is then computed, encoding regions of multiple sizes centered on each pixel. Feature vectors are aggregated and fed to a classifier which estimates the distribution of object categories in a segment. A subset of tree nodes that cover the image are selected to maximize the average "purity" of the class distributions (i.e. maximizing the likelihood that each segment will contain a single object). The convolutional network feature extractor is trained end-to-end from raw pixels, so there is no need for engineered features.

There are five main ingredients to this new method for FSL:

# Trainable, dense, multi-scale feature extraction
# Segmentation tree
# Regionwise feature aggregation
# Class histogram estimation
# Optimal purity cover

The three main contributions of this paper are:

# Using a multi-scale convolutional net to learn good features for region classification
# Using a class purity criterion to decide if a segment contains a single object, as opposed to several objects, or part of an object
# An efficient procedure to obtain a cover that optimizes the overall class purity of a segmentation

= Previous Work =

Most previous methods of FSL rely on MRFs, CRFs, or other types of graphical models to ensure consistency in the labeling and to account for context. This is typically done using a pre-segmentation into super-pixels or other segment candidates. Features and categories are then extracted from individual segments and combinations of neighboring segments.

Using trees allows the use of fast inference algorithms based on graph cuts or other methods. In this paper, an innovative method based on finding a set of tree nodes that cover the images while minimizing some criterion is used.

= Model =

This model relies on two complementary image representations. In the first representation, the image is seen as a point in a high-dimensional space, and we seek to find a transform that maps these images into a space in which each pixel can be assigned a label using a simple linear classifier.

The full model is shown in the diagram below. It is an end-to-end trainable model for scene parsing.

[[File:SceneModelDiagram.png]]

== Pre-processing ==

Before being put into the Convolutional Neural Network (CNN) multiple scaled versions of the image are generated. The set of these scaled images is called a ''pyramid''. There were three different scale outputs of the image created, in a similar manner shown in the picture below

[[File:Image_pyramid.png ]]

The scaling can be done by different transforms; the paper suggests to use the Laplacian transform. The Laplacian is the sum of partial second derivatives <math>\nabla^2 f = \frac{\partial^2 f}{\partial x^2} + \frac{\partial^2 f}{\partial y^2}</math>. A two-dimensional discrete approximation is given by the matrix <math>\left[\begin{array}{ccc}0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0\end{array}\right]</math>.

== Network Architecture ==

More holistic tasks, such as full-scene understanding (pixel-wise labeling, or any dense feature estimation) require the system to model complex interactions at the scale of complete images, not simply within a patch. In this problem the dimensionality becomes unmanageable: for a typical image of 256×256 pixels, a naive neural network would require millions of parameters, and a naive convolutional network would require filters that are unreasonably large to view enough context. The multiscale convolutional network overcomes these limitations by extending the concept of weight replication to the scale space. The more scales used to jointly train the models, the better the representation becomes for all scales. Using the same function to extract features at each scale is justified because the image content is scale invariant in principle. The authors noted that they observed worse performance when the weight sharing was removed.

== Post-Processing ==

In this model the sampling is done using an elastic max-pooling function, which remaps input patterns of arbitrary size into a fixed G×G grid (in this case a 5x5 grid was used). This grid can be seen as a highly invariant representation that encodes spatial relations between an object’s attributes/parts. This representation is denoted Ok and is shown in the diagram below. With this encoding elongated or ill-shaped objects are nicely handled. The dominant features are also used to represent the object, and when combined with background subtraction, these features represent good basis functions to recognize the underlying object. These features are then associated to the corresponding areas of the tree segmentation of the image (generated by creating a minimum spanning tree from the dissimilarity graph of neighboring pixels) for optimal cover calculation.

[[File:SceneGridFeatures.png]]

One of the important features of this model is its method for optimal cover, which is detailed in the diagram below. The leaf nodes represent pixels in the image and a subset of tree nodes are selected whose aggregate children span the entire image. The nodes are selected to minimize the average "impurity" of the class distribution (i.e. the entropy). The cover attempts to find an overall consisten segmentation, where each selected node corresponds to a particular class labelling for itself and all of its unselected children.

[[File:SceneOptimalCover.png]]

== Training ==

Training is done in a two step process. First, the low level feature extractor <math>f_s</math> is trained to produce features that are maximally discriminative. Then, the classifier <math>c</math> is trained to predict the distriubiton of casses in a component. The feature vectors are obtained by concatenating the network outputs for different scales of the multiscale pyramid. To train for them the loss function
<math>L_{\mathrm{cat}} = - \sum_{i \in \mathrm{pixels}, a \in \mathrm{classes}} c_{i,a} \ln(\hat{c}_{i,a})</math>
is used, where <math>c_i</math> is the true (classification) target vector and <math>\hat{c}_i</math> the prediction from a linear classifier (which is only used in this step and will be discarded later).

After training parameters for the feature extraction, parameters of the actual classifier is trained my minimizing the Kullback-Leibler-divergence (KL-divergence) between the true distribution of labels in each component and the prediction from the classifier. The KL-divergence is a measure of the difference between two probability distributions.

= Experiments =

For all experiments, a 2-stage convolutional network was used. The input is a 3-channel image, and it is transformed into a 16-dimensional feature map, using a bank of 16 7x7 filters followed by tanh units. This feature map is then pooled using a 2x2 max-pooling layer. The second layer transforms the 16-dimensional feature map into a 64-dimensional feature map, with each component being produced by a combination of 8 7x7 filters (for an effective total of 512 filters), followed by tanh units. This map is also pooled using a 2x2 max-pooling layer. This 64-dimensional feature map is transformed into a 256-dimensional feature map by using a combination of 16 7x7 filters (2048 filters).

The network is applied to a locally normalized Laplacian pyramid constructed on the input image. The pyramid contains three rescaled versions of the input: 320x240, 160x120, and 80x60. All of the inputs are properly padded and the outputs of each of the three networks are upsampled and concatenated to produce a 768-dimensional feature vector map (256x3). The network is trained on all three scales in parallel.

A simple grid search was used to find the best learning rate and regularization parameters (weight decay). A holdout of 10% of the training data was used as a validation set during the parameter search. For both datasets, jitter was used to artificially expand the size of the training data, to try to allow features to not overfit irrelevant biases present in the data. This jitter included horizontal flipping, and rotations between -8 and 8 degrees.

The hierarchy used to find the optimal cover is a constructed on the raw image gradient, based on a standard volume criterion<ref>
F. Meyer and L. Najman. [http://onlinelibrary.wiley.com/doi/10.1002/9781118600788.ch9/summary "Segmentation, minimum spanning tree and hierarchies."] In L. Najman and H. Talbot, editors, Mathematical Morphology: from theory to application, chapter 9, pages 229–261. ISTE-Wiley, London, 2010.
</ref><ref>
J. Cousty and L. Najman. [http://link.springer.com/chapter/10.1007/978-3-642-21569-8_24 "Incremental algorithm for hierarchical minimum spanning forests and saliency of watershed cuts."] In 10th International Symposium on Mathematical Morphology (ISMM’11), LNCS, 2011.
</ref>, completed by removing non-informative small components (less than 100 pixels). Traditionally segmentation methods use a partition of segments (i.e. finding an optimal cut in the tree) rather than a cover. A number of graph cut methods were tried, but the results were systematically worse than the optimal cover method.

Two sampling methods for learning the multiscale features were tried on each dataset. One uses the natural frequencies of each class in the dataset, while the other balances them so that an equal number of each class is shown to the network. The results from each of these methods varied with the dataset used and are reported in the tables below. The authors only included the results for the frequency balancing method for the Stanford Background dataset as it consistently gave better results, but it could still be useful to have the results from the other method to help guide future work. Training with balanced frequencies allows better discrimination of small objects, and although it tends to have lower overall pixel-wise accuracy, it performs better from a recognition point of view. This observation can be seen in the tables below. The per-pixel accuracy for frequency balancing in the Barcelona dataset is quite poor, which the authors attribute by the fact that the dataset has a large amount of classes with very few training examples, leading to overfitting when trying to model them in this manner.

= Results =

[[File:SceneResultTableStanford.png]]

[[File:SceneResultTableSIFT.png]]

[[File:SceneResultTableBarcelona.png]]

[[File:SceneResultPictures.png]]

=References=
<references />

deep Learning of the tissue-regulated splicing code

2015-12-11T17:56:35Z

Amirlk: /* Training the model */

= Introduction =

Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network<ref>https://www.cs.cmu.edu/afs/cs/academic/class/15782-f06/slides/bayesian.pdf</ref> (BNN), and Multinomial Logistic Regression<ref>https://en.wikipedia.org/wiki/Multinomial_logistic_regression</ref> (MLR).

A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\Delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.

= Model =

The dataset consists of 11019 mouse alternative exons profiled from RNA-Seq<ref>https://en.wikipedia.org/wiki/RNA-Seq</ref> Data. Five tissue types are available, including brain, heart, kidney, liver and testis.

The DNN is fully connected, with multiple layers of non-linearity consisting of hidden units. The mathematical expression of model is below:

::::::: <math>{a_v}^l = f(\sum_{m}^{M^{l-1}}{\theta_{v,m}^{l}a_m^{l-1}})</math>
:::::::where a is the weighted sum of outputs from the previous layer. <math>\theta_{v,m}^{l}</math> is the weights between layers.

::::::: <math>f_{RELU}(z)=max(0,z)</math>
::::::: The RELU unit was used for all hidden units except for the first hidden layer, which uses TANH units.

::::::: <math>h_k=\frac{exp(\sum_m{\theta_{k,m}^{last}a_m^{last}})}{\sum_{k'}{exp(\sum_{m}{\theta_{k',m}^{last}a_m^{last}})}}</math>
::::::: this is the softmax function of the last layer.

The cost function we want to minimize here during training is <math>E=-\sum_a\sum_{k=1}^{C}{y_{n,k}log(h{n,k})}</math>, where <math>n</math> denotes the training example, and <math>k</math> indexes <math>C</math> classes.

The identity of two tissues are then appended to the vector of outputs of the first hidden layer, together forming the input into the second hidden layer. The identity is a 1-of-5 binary variables in this case. (Demonstrated in Fig.1) The first targets for training contains three classes, which labeled as ''low'', ''medium'', ''high'' (LMH code). The second task describes the <math>\Delta PSI</math> between two tissues for a particular exon. The three classes corresponds to this task is ''decreased inclusion'', ''no change'' and ''increased inclusion'' (DNI code).Both the LMH and DNI codes are trained jointly, reusing the same hidden representations learned by the model. The DNN used backpropagation with dropout to train the data, and used different learning rates for two tasks.

[[File: Modell.png]]

= Training the model =

The first hidden layer was trained as an autoencoder to reduce the dimensionality of the feature in an unsupervised manner. This method of pretraining the network has been used in deep architecture to initialize learning near a good local minimum. In the second stage of training, the weights from the input layer to the first hidden layer are fixed, and 10 additional inputs corresponding to tissues are appended. The vector representation for tissue is a binary vector. For example, it takes the form [0 1 0 0 0] to denote the second tissue out of five possible types. Moreover, the weights connected to the rest hidden layers of the DNN are then trained together in a supervised layers with backpropagation method.

The DNN weights were initialized with small random values sampled from a zero-mean Gaussian distribution. Learning was performed with stochastic gradient descent with momentum and dropout, where mini-batches were constructed. A small L1 weight penalty was included in the cost function. The model’s weights were updated after each mini-batch. The learning rate was decreased with epochs <math>\epsilon</math>, and also included a momentum term <math>\mu</math> that starts out at 0.5, increasing to 0.99, and then stays fixed. The weights of the model parameters <math>\theta</math> were updated as follows: 

::: <math> \, \theta_e = \theta_{e-1} + \Delta \theta_e </math>

::: <math> \Delta\theta_e = \mu_e\Delta\theta_{e-1} - (1-\mu_e)\epsilon_e\nabla E(\theta_e) </math>

In addition, they filtered the data first before training by excluding examples if the total number RNA-Seq junction reads is below 10. This removed 45.8% of the total number of training examples.

Both the LMH and DNI codes are trained together. Because each of these two tasks might be learning at different rates. This is to prevent one task from overfitting too soon and negatively affecting the performance of another task before the complete model is fully trained.

The targets consist of (i) PSI for each of the two tissues and (ii) <math> \Delta PSI </math> between the two tissues. As a result, given same tissues, the model should predict no change for <math> \Delta PSI </math>. Also, if the tissues are swapped in the input, the previous increased inclusion label should become decrease. The training examples are constructed with some redundancy (i.e., in some of the training examples the two tissues are identical) so the model will learn this without it having to be be explicitly specified.

The batches for training were biased such that earlier batches contain 4/5 samples with higher tissues variability and 1/5 with low tissue variablity. After the high-variability examples are all used, the batches randomly select from the remaining lower-variability examples. The stated purpose is to give examples with high-tissue variability greater importance, while avoiding over-fitting by having them early in the training.

= Performance comparison =

The performance of the model was assessed using the area under the Receiver-Operating Characteristic curve (AUC) metric. This paper compared three methods through the same baseline, DNN, BNN and MLR.

The result (LMH code) shows in the table below. Table 1a reports AUC for PSI predictions from the LMH code on all tissues; while 1b reports AUC evaluated on the subset of events that exhibit large tissue variability. From 1a, the performance of DNN in ''low'' and ''high'' categories are comparable with the BNN, but outperformed at the ''medium'' level. From 1b, DNN significantly outperformed BNN and MLR. In both comparison, MLR performed poorly.

[[File: LMH.png]]

Next, we look at how well the different methods can predict <math>\Delta PSI</math> (DNI code). DNN predicts LMH code and DNI code at the same time; while in BNN, the model can only predict LMH code. Thus, for a fair comparison. author used a MLR on the predicted outputs for each tissue pair from BNN and similarly trained MLR on the LMH outputs of the DNN. Table 2 shows that both DNN and DNN+MLR outperformed the BNN+MLR or MLR.

[[File: DNI.png]]

'''Why did DNN outperform?'''

1. The use of tissue types as an input freature, which stringently required the model's hidden representations be in a form that can be well-modulated by information specifying the different tissue types for splicing pattern prediction.

2. The model is described by thousands of hidden units and multiple layers of non-linearity. In contrast, BNN only has 30 hidden units, which may not be sufficient.

3. A hyperparameter search is performed to optimize the DNN.

4. The use of dropout, which contributed ~1-6% improvement in the LMH code for different tissues, and ~2-7% in the DNI code, compared with without dropout.

5. Training was biased toward the tissue-specific events (by construction of minibatches).

= Conclusion =

This work shows that DNN can also be used in a sparse biological dataset. Furthermore, the input features can be analyzed in terms of the predictions of the model to gain some insights into the inferred tissue-regulated splicing code. This architecture can easily be extended to the case of more data from different sources.

= reference =

<references />

deep Learning of the tissue-regulated splicing code

2015-12-11T17:46:53Z

Amirlk: /* Training the model */

= Introduction =

Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network<ref>https://www.cs.cmu.edu/afs/cs/academic/class/15782-f06/slides/bayesian.pdf</ref> (BNN), and Multinomial Logistic Regression<ref>https://en.wikipedia.org/wiki/Multinomial_logistic_regression</ref> (MLR).

A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\Delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.

= Model =

The dataset consists of 11019 mouse alternative exons profiled from RNA-Seq<ref>https://en.wikipedia.org/wiki/RNA-Seq</ref> Data. Five tissue types are available, including brain, heart, kidney, liver and testis.

The DNN is fully connected, with multiple layers of non-linearity consisting of hidden units. The mathematical expression of model is below:

::::::: <math>{a_v}^l = f(\sum_{m}^{M^{l-1}}{\theta_{v,m}^{l}a_m^{l-1}})</math>
:::::::where a is the weighted sum of outputs from the previous layer. <math>\theta_{v,m}^{l}</math> is the weights between layers.

::::::: <math>f_{RELU}(z)=max(0,z)</math>
::::::: The RELU unit was used for all hidden units except for the first hidden layer, which uses TANH units.

::::::: <math>h_k=\frac{exp(\sum_m{\theta_{k,m}^{last}a_m^{last}})}{\sum_{k'}{exp(\sum_{m}{\theta_{k',m}^{last}a_m^{last}})}}</math>
::::::: this is the softmax function of the last layer.

The cost function we want to minimize here during training is <math>E=-\sum_a\sum_{k=1}^{C}{y_{n,k}log(h{n,k})}</math>, where <math>n</math> denotes the training example, and <math>k</math> indexes <math>C</math> classes.

The identity of two tissues are then appended to the vector of outputs of the first hidden layer, together forming the input into the second hidden layer. The identity is a 1-of-5 binary variables in this case. (Demonstrated in Fig.1) The first targets for training contains three classes, which labeled as ''low'', ''medium'', ''high'' (LMH code). The second task describes the <math>\Delta PSI</math> between two tissues for a particular exon. The three classes corresponds to this task is ''decreased inclusion'', ''no change'' and ''increased inclusion'' (DNI code).Both the LMH and DNI codes are trained jointly, reusing the same hidden representations learned by the model. The DNN used backpropagation with dropout to train the data, and used different learning rates for two tasks.

[[File: Modell.png]]

= Training the model =

The first hidden layer was trained as an autoencoder to reduce the dimensionality of the feature in an unsupervised manner. This method of pretraining the network has been used in deep architecture to initialize learning near a good local minimum. In the second stage of training, the weights from the input layer to the first hidden layer are fixed, and 10 additional inputs corresponding to tissues are appended. The vector representation for tissue is a binary vector. For example, it takes the form [0 1 0 0 0] to denote the second tissue out of five possible types. Moreover, the weights connected to the rest hidden layers of the DNN are then trained together in a supervised layers with backpropagation method.

The DNN weights were initialized with small random values sampled from a zero-mean Gaussian distribution. Learning was performed with stochastic gradient descent with momentum and dropout, where mini-batches were constructed. A small L1 weight penalty was included in the cost function. The model’s weights were updated after each mini-batch. The learning rate was decreased with epochs <math>\epsilon</math>, and also included a momentum term <math>\mu</math> that starts out at 0.5, increasing to 0.99, and then stays fixed. The weights of the model parameters <math>\theta</math> were updated as follows:

In addition, they filtered the data first before training by excluding examples if the total number RNA-Seq junction reads is below 10. This removed 45.8% of the total number of training examples.

Both the LMH and DNI codes are trained together. Because each of these two tasks might be learning at different rates. This is to prevent one task from overfitting too soon and negatively affecting the performance of another task before the complete model is fully trained.

The targets consist of (i) PSI for each of the two tissues and (ii) <math> \Delta PSI </math> between the two tissues. As a result, given same tissues, the model should predict no change for <math> \Delta PSI </math>. Also, if the tissues are swapped in the input, the previous increased inclusion label should become decrease. The training examples are constructed with some redundancy (i.e., in some of the training examples the two tissues are identical) so the model will learn this without it having to be be explicitly specified.

The batches for training were biased such that earlier batches contain 4/5 samples with higher tissues variability and 1/5 with low tissue variablity. After the high-variability examples are all used, the batches randomly select from the remaining lower-variability examples. The stated purpose is to give examples with high-tissue variability greater importance, while avoiding over-fitting by having them early in the training.

= Performance comparison =

The performance of the model was assessed using the area under the Receiver-Operating Characteristic curve (AUC) metric. This paper compared three methods through the same baseline, DNN, BNN and MLR.

The result (LMH code) shows in the table below. Table 1a reports AUC for PSI predictions from the LMH code on all tissues; while 1b reports AUC evaluated on the subset of events that exhibit large tissue variability. From 1a, the performance of DNN in ''low'' and ''high'' categories are comparable with the BNN, but outperformed at the ''medium'' level. From 1b, DNN significantly outperformed BNN and MLR. In both comparison, MLR performed poorly.

[[File: LMH.png]]

Next, we look at how well the different methods can predict <math>\Delta PSI</math> (DNI code). DNN predicts LMH code and DNI code at the same time; while in BNN, the model can only predict LMH code. Thus, for a fair comparison. author used a MLR on the predicted outputs for each tissue pair from BNN and similarly trained MLR on the LMH outputs of the DNN. Table 2 shows that both DNN and DNN+MLR outperformed the BNN+MLR or MLR.

[[File: DNI.png]]

'''Why did DNN outperform?'''

1. The use of tissue types as an input freature, which stringently required the model's hidden representations be in a form that can be well-modulated by information specifying the different tissue types for splicing pattern prediction.

2. The model is described by thousands of hidden units and multiple layers of non-linearity. In contrast, BNN only has 30 hidden units, which may not be sufficient.

3. A hyperparameter search is performed to optimize the DNN.

4. The use of dropout, which contributed ~1-6% improvement in the LMH code for different tissues, and ~2-7% in the DNI code, compared with without dropout.

5. Training was biased toward the tissue-specific events (by construction of minibatches).

= Conclusion =

This work shows that DNN can also be used in a sparse biological dataset. Furthermore, the input features can be analyzed in terms of the predictions of the model to gain some insights into the inferred tissue-regulated splicing code. This architecture can easily be extended to the case of more data from different sources.

= reference =

<references />

deep Learning of the tissue-regulated splicing code

2015-12-11T17:43:59Z

Amirlk: /* Training the model */

= Introduction =

Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network<ref>https://www.cs.cmu.edu/afs/cs/academic/class/15782-f06/slides/bayesian.pdf</ref> (BNN), and Multinomial Logistic Regression<ref>https://en.wikipedia.org/wiki/Multinomial_logistic_regression</ref> (MLR).

A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\Delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.

= Model =

The dataset consists of 11019 mouse alternative exons profiled from RNA-Seq<ref>https://en.wikipedia.org/wiki/RNA-Seq</ref> Data. Five tissue types are available, including brain, heart, kidney, liver and testis.

The DNN is fully connected, with multiple layers of non-linearity consisting of hidden units. The mathematical expression of model is below:

::::::: <math>{a_v}^l = f(\sum_{m}^{M^{l-1}}{\theta_{v,m}^{l}a_m^{l-1}})</math>
:::::::where a is the weighted sum of outputs from the previous layer. <math>\theta_{v,m}^{l}</math> is the weights between layers.

::::::: <math>f_{RELU}(z)=max(0,z)</math>
::::::: The RELU unit was used for all hidden units except for the first hidden layer, which uses TANH units.

::::::: <math>h_k=\frac{exp(\sum_m{\theta_{k,m}^{last}a_m^{last}})}{\sum_{k'}{exp(\sum_{m}{\theta_{k',m}^{last}a_m^{last}})}}</math>
::::::: this is the softmax function of the last layer.

The cost function we want to minimize here during training is <math>E=-\sum_a\sum_{k=1}^{C}{y_{n,k}log(h{n,k})}</math>, where <math>n</math> denotes the training example, and <math>k</math> indexes <math>C</math> classes.

The identity of two tissues are then appended to the vector of outputs of the first hidden layer, together forming the input into the second hidden layer. The identity is a 1-of-5 binary variables in this case. (Demonstrated in Fig.1) The first targets for training contains three classes, which labeled as ''low'', ''medium'', ''high'' (LMH code). The second task describes the <math>\Delta PSI</math> between two tissues for a particular exon. The three classes corresponds to this task is ''decreased inclusion'', ''no change'' and ''increased inclusion'' (DNI code).Both the LMH and DNI codes are trained jointly, reusing the same hidden representations learned by the model. The DNN used backpropagation with dropout to train the data, and used different learning rates for two tasks.

[[File: Modell.png]]

= Training the model =

The first hidden layer was trained as an autoencoder to reduce the dimensionality of the feature in an unsupervised manner. This method of pretraining the network has been used in deep architecture to initialize learning near a good local minimum. In the second stage of training, the weights from the input layer to the first hidden layer are fixed, and 10 additional inputs corresponding to tissues are appended. The vector representation for tissue is a binary vector. For example, it takes the form [0 1 0 0 0] to denote the second tissue out of five possible types. Moreover, the weights connected to the rest hidden layers of the DNN are then trained together in a supervised layers with backpropagation method.

The DNN weights were initialized with small random values sampled from a zero-mean Gaussian distribution. Learning was performed with stochastic gradient descent with momentum and dropout, where mini-batches were constructed.

In addition, they filtered the data first before training by excluding examples if the total number RNA-Seq junction reads is below 10. This removed 45.8% of the total number of training examples.

Both the LMH and DNI codes are trained together. Because each of these two tasks might be learning at different rates. This is to prevent one task from overfitting too soon and negatively affecting the performance of another task before the complete model is fully trained.

The targets consist of (i) PSI for each of the two tissues and (ii) <math> \Delta PSI </math> between the two tissues. As a result, given same tissues, the model should predict no change for <math> \Delta PSI </math>. Also, if the tissues are swapped in the input, the previous increased inclusion label should become decrease. The training examples are constructed with some redundancy (i.e., in some of the training examples the two tissues are identical) so the model will learn this without it having to be be explicitly specified.

The batches for training were biased such that earlier batches contain 4/5 samples with higher tissues variability and 1/5 with low tissue variablity. After the high-variability examples are all used, the batches randomly select from the remaining lower-variability examples. The stated purpose is to give examples with high-tissue variability greater importance, while avoiding over-fitting by having them early in the training.

= Performance comparison =

The performance of the model was assessed using the area under the Receiver-Operating Characteristic curve (AUC) metric. This paper compared three methods through the same baseline, DNN, BNN and MLR.

The result (LMH code) shows in the table below. Table 1a reports AUC for PSI predictions from the LMH code on all tissues; while 1b reports AUC evaluated on the subset of events that exhibit large tissue variability. From 1a, the performance of DNN in ''low'' and ''high'' categories are comparable with the BNN, but outperformed at the ''medium'' level. From 1b, DNN significantly outperformed BNN and MLR. In both comparison, MLR performed poorly.

[[File: LMH.png]]

Next, we look at how well the different methods can predict <math>\Delta PSI</math> (DNI code). DNN predicts LMH code and DNI code at the same time; while in BNN, the model can only predict LMH code. Thus, for a fair comparison. author used a MLR on the predicted outputs for each tissue pair from BNN and similarly trained MLR on the LMH outputs of the DNN. Table 2 shows that both DNN and DNN+MLR outperformed the BNN+MLR or MLR.

[[File: DNI.png]]

'''Why did DNN outperform?'''

1. The use of tissue types as an input freature, which stringently required the model's hidden representations be in a form that can be well-modulated by information specifying the different tissue types for splicing pattern prediction.

2. The model is described by thousands of hidden units and multiple layers of non-linearity. In contrast, BNN only has 30 hidden units, which may not be sufficient.

3. A hyperparameter search is performed to optimize the DNN.

4. The use of dropout, which contributed ~1-6% improvement in the LMH code for different tissues, and ~2-7% in the DNI code, compared with without dropout.

5. Training was biased toward the tissue-specific events (by construction of minibatches).

= Conclusion =

This work shows that DNN can also be used in a sparse biological dataset. Furthermore, the input features can be analyzed in terms of the predictions of the model to gain some insights into the inferred tissue-regulated splicing code. This architecture can easily be extended to the case of more data from different sources.

= reference =

<references />

joint training of a convolutional network and a graphical model for human pose estimation

2015-12-11T05:09:17Z

Amirlk: /* Higher-Level Spatial-Model */

== Introduction ==

Human body pose estimation, or specifically the localization of human joints in monocular RGB images, remains a very challenging task in computer vision. Recent approaches to this problem fall into two broad categories: traditional deformable part models and deep-learning based discriminative models. Traditional models rely on the aggregation of hand-crafted low-level features and then use a standard classifier or a higher level generative model to detect the pose, which require the features to be sensitive enough and invariant to deformations. Deep learning approaches learn an empirical set of low and high-level features which are more tolerant to variations. However, it’s difficult to incorporate prior knowledge about the structure of the human body.

This paper proposes a new hybrid architecture that consists of a deep Convolutional Network Part-Detector and a part-based Spatial-Model. This combination and joint training significantly outperforms existing state-of-the-art models on the task of human body pose recognition.

== Model ==
=== Convolutional Network Part-Detector ===

They combine an efficient ConvNet architecture with multi-resolution and overlapping receptive fields, which is shown in the figure below.

[[File:architecture1.PNG | center]]

Traditionally, in image processing tasks such as these, a Laplacian Pyramid<ref>
[https://en.wikipedia.org/wiki/Pyramid_(image_processing)#Gaussian_pyramid "Pyramid (image processing)"]
</ref> of three resolution banks is used to provide each bank with non-overlapping spectral content. Then the Local Contrast Normalization (LCN<ref>
Collobert R, Kavukcuoglu K, Farabet C.[http://infoscience.epfl.ch/record/192376/files/Collobert_NIPSWORKSHOP_2011.pdf Torch7: A matlab-like environment for machine learning] BigLearn, NIPS Workshop. 2011 (EPFL-CONF-192376).
</ref>) is applied to those input images. However, in this model, only a full image stage and a half-resolution stage was used, allowing for a simpler architecture and faster training.

Although, a sliding window architecture is usually used for this type of task, it has the down side of creating redundant convolutions. Instead, in this network, for each resolution bank, ConvNet architecture with overlapping receptive fields is used to get a heat-map as output, which produces a per-pixel likelihood for key joint locations on the human skeleton.

The convolution results (feature maps) of the low resolution bank are upscaled and interleaved with those of high resolution bank. Then, these dense feature maps are processed through convolution stages at each pixel, which is equivalent to fully-connected network model but more efficient.

Supervised training of the network is performed using batched Stochastic Gradient Descent (SGD) with Nesterov Momentum. They use a Mean Squared Error (MSE) criterion to minimize the distance between the predicted output and a target heat-map. At training time they also perform random perturbations of the input images (randomly flipping and scaling the images) to increase generalization performance.

=== Higher-Level Spatial-Model ===

They use a higher-level Spatial-Model to get rid of false positive outliers and anatomically incorrect poses predicted by the Part-Detector, constraining joint inter-connectivity and enforcing global pose consistency.

They formulate the Spatial-Model as an MRF-like model over the distribution of spatial locations for each body part. After the unary potentials for each body part location are provided by the Part-Detector, the pair-wise potentials in the graph are computed using convolutional priors, which model the conditional distribution of the location of one body part to another. For instance, the final marginal likelihood for a body part A can be calculated as:

<math>\bar{p}_{A}=\frac{1}{Z}\prod_{v\in V}^{ }\left ( p_{A|v}*p_{v}+b_{v\rightarrow A} \right )</math>

Where <math>v</math> is the joint location, <math>p_{A|v}</math> is the conditional prior which is the likelihood of the body part A occurring in pixel location (i, j) when joint <math>v</math> is located at the center pixel, <math>b_{v\rightarrow A}</math> is a bias term used to describe the background probability for the message from joint <math>v</math> to A, and Z is the partition function. The learned pair-wise distributions are purely uniform when any pairwise edge should be removed from the graph structure. The above equation is analogous to a single round of sum-product belief propagation. Convergence to a global optimum is not guaranteed given that this spatial model is not tree structured. However, the inferred solution is sufficiently accurate for all poses in datasets used in this research.

For their practical implementation, they treat the distributions above as energies to avoid the evaluation of Z in the equation before. Their final model is

<math>\bar{e}_{A}=\mathrm{exp}\left ( \sum_{v\in V}^{ }\left [ \mathrm{log}\left ( \mathrm{SoftPlus}\left ( e_{A|v} \right )*\mathrm{ReLU}\left ( e_{v} \right )+\mathrm{SoftPlus}\left ( b_{v\rightarrow A} \right ) \right ) \right ] \right )</math>

<math>\mathrm{where:SoftPlus}\left ( x \right )=\frac{1}{\beta }\mathrm{log}\left ( 1+\mathrm{exp}\left ( \beta x \right ) \right ), 0.5\leq \beta \leq 2</math>

<math>\mathrm{ReLU}\left ( x \right )=\mathrm{max}\left ( x,\epsilon \right ), 0< \epsilon \leq 0.01</math>

This model replaces the outer multiplication of final marginal likelihood with a log space addition to improve numerical stability and to prevent coupling of the convolution output gradients (the addition in log space means that the partial derivative of the loss function with respect to the convolution output is not dependent on the output of any other stages). With this modified formulation, the equation can be trained by using back-propagation and SGD. The network-based implementation of the equation is shown below.

[[File:architecture2.PNG | center]]

The convolution kernels they use in this step is quite large, thus they apply FFT convolutions based on the GPU, which is introduced by Mathieu et al.<ref>
Mathieu M, Henaff M, LeCun Y.[http://arxiv.org/pdf/1312.5851.pdf Fast training of convolutional networks through ffts] arXiv preprint arXiv:1312.5851, 2013.
</ref>.The convolution weights are initialized using the empirical histogram of joint displacements created from the training examples. Moreover, during training they randomly flip and scale the heat-map inputs to improve generalization performance.

=== Unified Model ===

They first train the Part-Detector separately and store the heat-map outputs, then use these heat-maps to train a Spatial-Model. Finally, they combine the trained Part-Detector and Spatial-Models and back-propagate through the entire network, which further improves performance.
Because the SpatialModel is able to effectively reduce the output dimension of possible heat-map activations, the PartDetector can use available learning capacity to better localize the precise target activation.

== Results ==

They evaluated their architecture on the FLIC and extended-LSP datasets. The FLIC dataset is comprised of 5003 images from Hollywood movies with actors in predominantly front-facing standing up poses, while the extended-LSP dataset contains a wider variety of poses of athletes playing sport. They also proposed a new dataset called FLIC-plus<ref>
[http://cims.nyu.edu/~tompson/flic_plus.htm "FLIC-plus Dataset"]
</ref> which is fairer than FLIC-full dataset.

Their model’s performance on the FLIC test-set for the elbow and wrist joints is shown below. It’s trained by using both the FLIC and FLIC-plus training sets.

[[File:result1.PNG | center]]

Performance on the LSP dataset is shown here.

[[File:result2.PNG | center]]

Since the LSP dataset cover a larger range of the possible poses, their Spatial-Model is less effective. The accuracy for this dataset is lower than FLIC. They believe that increasing the size of the training set will improve performance for these difficult cases.

== Bibliography ==
<references />

joint training of a convolutional network and a graphical model for human pose estimation

2015-12-11T05:04:05Z

Amirlk: /* Higher-Level Spatial-Model */

== Introduction ==

Human body pose estimation, or specifically the localization of human joints in monocular RGB images, remains a very challenging task in computer vision. Recent approaches to this problem fall into two broad categories: traditional deformable part models and deep-learning based discriminative models. Traditional models rely on the aggregation of hand-crafted low-level features and then use a standard classifier or a higher level generative model to detect the pose, which require the features to be sensitive enough and invariant to deformations. Deep learning approaches learn an empirical set of low and high-level features which are more tolerant to variations. However, it’s difficult to incorporate prior knowledge about the structure of the human body.

This paper proposes a new hybrid architecture that consists of a deep Convolutional Network Part-Detector and a part-based Spatial-Model. This combination and joint training significantly outperforms existing state-of-the-art models on the task of human body pose recognition.

== Model ==
=== Convolutional Network Part-Detector ===

They combine an efficient ConvNet architecture with multi-resolution and overlapping receptive fields, which is shown in the figure below.

[[File:architecture1.PNG | center]]

Traditionally, in image processing tasks such as these, a Laplacian Pyramid<ref>
[https://en.wikipedia.org/wiki/Pyramid_(image_processing)#Gaussian_pyramid "Pyramid (image processing)"]
</ref> of three resolution banks is used to provide each bank with non-overlapping spectral content. Then the Local Contrast Normalization (LCN<ref>
Collobert R, Kavukcuoglu K, Farabet C.[http://infoscience.epfl.ch/record/192376/files/Collobert_NIPSWORKSHOP_2011.pdf Torch7: A matlab-like environment for machine learning] BigLearn, NIPS Workshop. 2011 (EPFL-CONF-192376).
</ref>) is applied to those input images. However, in this model, only a full image stage and a half-resolution stage was used, allowing for a simpler architecture and faster training.

Although, a sliding window architecture is usually used for this type of task, it has the down side of creating redundant convolutions. Instead, in this network, for each resolution bank, ConvNet architecture with overlapping receptive fields is used to get a heat-map as output, which produces a per-pixel likelihood for key joint locations on the human skeleton.

The convolution results (feature maps) of the low resolution bank are upscaled and interleaved with those of high resolution bank. Then, these dense feature maps are processed through convolution stages at each pixel, which is equivalent to fully-connected network model but more efficient.

Supervised training of the network is performed using batched Stochastic Gradient Descent (SGD) with Nesterov Momentum. They use a Mean Squared Error (MSE) criterion to minimize the distance between the predicted output and a target heat-map. At training time they also perform random perturbations of the input images (randomly flipping and scaling the images) to increase generalization performance.

=== Higher-Level Spatial-Model ===

They use a higher-level Spatial-Model to get rid of false positive outliers and anatomically incorrect poses predicted by the Part-Detector, constraining joint inter-connectivity and enforcing global pose consistency.

They formulate the Spatial-Model as an MRF-like model over the distribution of spatial locations for each body part. After the unary potentials for each body part location are provided by the Part-Detector, the pair-wise potentials in the graph are computed using convolutional priors, which model the conditional distribution of the location of one body part to another. For instance, the final marginal likelihood for a body part A can be calculated as:

<math>\bar{p}_{A}=\frac{1}{Z}\prod_{v\in V}^{ }\left ( p_{A|v}*p_{v}+b_{v\rightarrow A} \right )</math>

Where <math>v</math> is the joint location, <math>p_{A|v}</math> is the conditional prior which is the likelihood of the body part A occurring in pixel location (i, j) when joint <math>v</math> is located at the center pixel, <math>b_{v\rightarrow A}</math> is a bias term used to describe the background probability for the message from joint <math>v</math> to A, and Z is the partition function. The learned pair-wise distributions are purely uniform when any pairwise edge should be removed from the graph structure. The above equation is analogous to a single round of sum-product belief propagation. Convergence to a global optimum is not guaranteed given that this spatial model is not tree structured. However, the inferred solution is sufficiently accurate for all poses in datasets used in this research.

For their practical implementation, they treat the distributions above as energies to avoid the evaluation of Z in the equation before. Their final model is

<math>\bar{e}_{A}=\mathrm{exp}\left ( \sum_{v\in V}^{ }\left [ \mathrm{log}\left ( \mathrm{SoftPlus}\left ( e_{A|v} \right )*\mathrm{ReLU}\left ( e_{v} \right )+\mathrm{SoftPlus}\left ( b_{v\rightarrow A} \right ) \right ) \right ] \right )</math>

<math>\mathrm{where:SoftPlus}\left ( x \right )=\frac{1}{\beta }\mathrm{log}\left ( 1+\mathrm{exp}\left ( \beta x \right ) \right ), 0.5\leq \beta \leq 2</math>

<math>\mathrm{ReLU}\left ( x \right )=\mathrm{max}\left ( x,\epsilon \right ), 0< \epsilon \leq 0.01</math>

With this modified formulation, the equation can be trained by using back-propagation and SGD. The network-based implementation of the equation is shown below.

[[File:architecture2.PNG | center]]

The convolution kernels they use in this step is quite large, thus they apply FFT convolutions based on the GPU, which is introduced by Mathieu et al.<ref>
Mathieu M, Henaff M, LeCun Y.[http://arxiv.org/pdf/1312.5851.pdf Fast training of convolutional networks through ffts] arXiv preprint arXiv:1312.5851, 2013.
</ref>.The convolution weights are initialized using the empirical histogram of joint displacements created from the training examples. Moreover, during training they randomly flip and scale the heat-map inputs to improve generalization performance.

=== Unified Model ===

They first train the Part-Detector separately and store the heat-map outputs, then use these heat-maps to train a Spatial-Model. Finally, they combine the trained Part-Detector and Spatial-Models and back-propagate through the entire network, which further improves performance.
Because the SpatialModel is able to effectively reduce the output dimension of possible heat-map activations, the PartDetector can use available learning capacity to better localize the precise target activation.

== Results ==

They evaluated their architecture on the FLIC and extended-LSP datasets. The FLIC dataset is comprised of 5003 images from Hollywood movies with actors in predominantly front-facing standing up poses, while the extended-LSP dataset contains a wider variety of poses of athletes playing sport. They also proposed a new dataset called FLIC-plus<ref>
[http://cims.nyu.edu/~tompson/flic_plus.htm "FLIC-plus Dataset"]
</ref> which is fairer than FLIC-full dataset.

Their model’s performance on the FLIC test-set for the elbow and wrist joints is shown below. It’s trained by using both the FLIC and FLIC-plus training sets.

[[File:result1.PNG | center]]

Performance on the LSP dataset is shown here.

[[File:result2.PNG | center]]

Since the LSP dataset cover a larger range of the possible poses, their Spatial-Model is less effective. The accuracy for this dataset is lower than FLIC. They believe that increasing the size of the training set will improve performance for these difficult cases.

== Bibliography ==
<references />

joint training of a convolutional network and a graphical model for human pose estimation

2015-12-11T04:55:12Z

Amirlk: /* Unified Model */

== Introduction ==

Human body pose estimation, or specifically the localization of human joints in monocular RGB images, remains a very challenging task in computer vision. Recent approaches to this problem fall into two broad categories: traditional deformable part models and deep-learning based discriminative models. Traditional models rely on the aggregation of hand-crafted low-level features and then use a standard classifier or a higher level generative model to detect the pose, which require the features to be sensitive enough and invariant to deformations. Deep learning approaches learn an empirical set of low and high-level features which are more tolerant to variations. However, it’s difficult to incorporate prior knowledge about the structure of the human body.

This paper proposes a new hybrid architecture that consists of a deep Convolutional Network Part-Detector and a part-based Spatial-Model. This combination and joint training significantly outperforms existing state-of-the-art models on the task of human body pose recognition.

== Model ==
=== Convolutional Network Part-Detector ===

They combine an efficient ConvNet architecture with multi-resolution and overlapping receptive fields, which is shown in the figure below.

[[File:architecture1.PNG | center]]

Traditionally, in image processing tasks such as these, a Laplacian Pyramid<ref>
[https://en.wikipedia.org/wiki/Pyramid_(image_processing)#Gaussian_pyramid "Pyramid (image processing)"]
</ref> of three resolution banks is used to provide each bank with non-overlapping spectral content. Then the Local Contrast Normalization (LCN<ref>
Collobert R, Kavukcuoglu K, Farabet C.[http://infoscience.epfl.ch/record/192376/files/Collobert_NIPSWORKSHOP_2011.pdf Torch7: A matlab-like environment for machine learning] BigLearn, NIPS Workshop. 2011 (EPFL-CONF-192376).
</ref>) is applied to those input images. However, in this model, only a full image stage and a half-resolution stage was used, allowing for a simpler architecture and faster training.

Although, a sliding window architecture is usually used for this type of task, it has the down side of creating redundant convolutions. Instead, in this network, for each resolution bank, ConvNet architecture with overlapping receptive fields is used to get a heat-map as output, which produces a per-pixel likelihood for key joint locations on the human skeleton.

The convolution results (feature maps) of the low resolution bank are upscaled and interleaved with those of high resolution bank. Then, these dense feature maps are processed through convolution stages at each pixel, which is equivalent to fully-connected network model but more efficient.

Supervised training of the network is performed using batched Stochastic Gradient Descent (SGD) with Nesterov Momentum. They use a Mean Squared Error (MSE) criterion to minimize the distance between the predicted output and a target heat-map. At training time they also perform random perturbations of the input images (randomly flipping and scaling the images) to increase generalization performance.

=== Higher-Level Spatial-Model ===

They use a higher-level Spatial-Model to get rid of false positive outliers and anatomically incorrect poses predicted by the Part-Detector, constraining joint inter-connectivity and enforcing global pose consistency.

They formulate the Spatial-Model as an MRF-like model over the distribution of spatial locations for each body part. After the unary potentials for each body part location are provided by the Part-Detector, the pair-wise potentials in the graph are computed using convolutional priors, which model the conditional distribution of the location of one body part to another. For instance, the final marginal likelihood for a body part A can be calculated as:

<math>\bar{p}_{A}=\frac{1}{Z}\prod_{v\in V}^{ }\left ( p_{A|v}*p_{v}+b_{v\rightarrow A} \right )</math>

Where <math>v</math> is the joint location, <math>p_{A|v}</math> is the conditional prior which is the likelihood of the body part A occurring in pixel location (i, j) when joint <math>v</math> is located at the center pixel, <math>b_{v\rightarrow A}</math> is a bias term used to describe the background probability for the message from joint <math>v</math> to A, and Z is the partition function. The learned pair-wise distributions are purely uniform when any pairwise edge should to be removed from the graph structure.

For their practical implementation they treat the distributions above as energies to avoid the evaluation of Z in the equation before. Their final model is

<math>\bar{e}_{A}=\mathrm{exp}\left ( \sum_{v\in V}^{ }\left [ \mathrm{log}\left ( \mathrm{SoftPlus}\left ( e_{A|v} \right )*\mathrm{ReLU}\left ( e_{v} \right )+\mathrm{SoftPlus}\left ( b_{v\rightarrow A} \right ) \right ) \right ] \right )</math>

<math>\mathrm{where:SoftPlus}\left ( x \right )=\frac{1}{\beta }\mathrm{log}\left ( 1+\mathrm{exp}\left ( \beta x \right ) \right ), 0.5\leq \beta \leq 2</math>

<math>\mathrm{ReLU}\left ( x \right )=\mathrm{max}\left ( x,\epsilon \right ), 0< \epsilon \leq 0.01</math>

With this modified formulation, the equation can be trained by using back-propagation and SGD. The network-based implementation of the equation is shown below.

[[File:architecture2.PNG | center]]

The convolution kernels they use in this step is quite large, thus they apply FFT convolutions based on the GPU, which is introduced by Mathieu et al.<ref>
Mathieu M, Henaff M, LeCun Y.[http://arxiv.org/pdf/1312.5851.pdf Fast training of convolutional networks through ffts] arXiv preprint arXiv:1312.5851, 2013.
</ref>.The convolution weights are initialized using the empirical histogram of joint displacements created from the training examples. Moreover, during training they randomly flip and scale the heat-map inputs to improve generalization performance.

=== Unified Model ===

They first train the Part-Detector separately and store the heat-map outputs, then use these heat-maps to train a Spatial-Model. Finally, they combine the trained Part-Detector and Spatial-Models and back-propagate through the entire network, which further improves performance.
Because the SpatialModel is able to effectively reduce the output dimension of possible heat-map activations, the PartDetector can use available learning capacity to better localize the precise target activation.

== Results ==

They evaluated their architecture on the FLIC and extended-LSP datasets. The FLIC dataset is comprised of 5003 images from Hollywood movies with actors in predominantly front-facing standing up poses, while the extended-LSP dataset contains a wider variety of poses of athletes playing sport. They also proposed a new dataset called FLIC-plus<ref>
[http://cims.nyu.edu/~tompson/flic_plus.htm "FLIC-plus Dataset"]
</ref> which is fairer than FLIC-full dataset.

Their model’s performance on the FLIC test-set for the elbow and wrist joints is shown below. It’s trained by using both the FLIC and FLIC-plus training sets.

[[File:result1.PNG | center]]

Performance on the LSP dataset is shown here.

[[File:result2.PNG | center]]

Since the LSP dataset cover a larger range of the possible poses, their Spatial-Model is less effective. The accuracy for this dataset is lower than FLIC. They believe that increasing the size of the training set will improve performance for these difficult cases.

== Bibliography ==
<references />

from Machine Learning to Machine Reasoning

2015-12-09T03:16:34Z

Amirlk: /* Probabilistic Models */

== Introduction ==
Learning and reasoning are both essential abilities associated with intelligence. Consequently, machine learning and machine reasoning have received considerable attention given the short history of computer science. The statistical nature of machine learning is now understood but the ideas behind machine reasoning are much more elusive. Converting ordinary data into a set of logical rules proves to be very challenging: searching the discrete space of symbolic formulas leads to combinatorial explosion <ref>Lighthill, J. [http://www.math.snu.ac.kr/~hichoi/infomath/Articles/Lighthill%20Report.pdf "Artificial intelligence: a general survey."] In Artificial intelligence: a paper symposium. Science Research Council.</ref>. Algorithms for probabilistic inference <ref>Pearl, J. [http://bayes.cs.ucla.edu/BOOK-2K/neuberg-review.pdf "Causality: models, reasoning, and inference."] Cambridge: Cambridge University Press.</ref> still suffer from unfavourable computational properties <ref>Roth, D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.6074&rep=rep1&type=pdf "On the hardness of approximate reasoning"] Artificial Intelligence, 82, 273–302.</ref>. Algorithms for inference do exist but they do however, come at a price of reduced expressive capabilities in logical inference and probabilistic inference.

Humans display neither of these limitations.

The ability to reason is not the same as the ability to make logical inferences. The way that humans reason provides evidence to suggest the existence of a middle layer, already a form of reasoning, but not yet formal or logical. Informal logic is attractive because we hope to avoid the computational complexity that is associated with combinatorial searches in the vast space of discrete logic propositions.

This paper shows how deep learning and multi-task learning can be leveraged as a rudimentary form of reasoning to help solve a task of interest.

This approach is explored along a number of auxiliary tasks.

== Auxiliary Tasks ==

The usefulness of auxiliary tasks were examined within the contexts of two problems; face-based identification and natural language processing. Both these examples show how an easier task (determining whether two faces are different) can be used to boost performance on a harder task (identifying faces) using inference.

'''Face-based Identification'''

Identifying a person from face images is challenging. It remains expensive to collect and label millions of images representing the face of each subject with a good variety of positions and contexts. However, it is easier to collect training data for a slightly different task of telling whether two faces in images represent the same person or not: two faces in the same picture are likely to belong to two different people; two faces in successive video frames are likely to belong to the same person. These two tasks have much in common image analysis primitives, feature extraction, part recognizers trained on the auxiliary task can help solve the original task.

Figure below illustrates a transfer learning strategy involving three trainable models. The preprocessor P computes a compact face representation of the image and the comparator labels the face. We first assemble two preprocessors P and one comparator D and train this model with abundant labels for the auxiliary task. Then we assemble another instance of P with classifier C and train the resulting model using a restrained number of labelled examples from the original task.

[[File:figure1.JPG | center]]

'''Natural Language Processing'''

The auxiliary task in this case (left diagram of figure below) is identifying if a sentence is correct or not. This creates embedding for works in a 50 dimensional space. This embedding can than be used on the primary problem (right diagram of the figure below) of producing tags for the works. Note the shared classification "W" modules shared between the tasks.

[[File:word_transfer.png | center]]

== Reasoning Revisited ==
Little attention has been paid to the rules that describe how to assemble trainable models that perform specific tasks. However, these composition rules play an extremely important rule as they describe algebraic manipulations that let us combine previously acquire knowledge in order to create a model that addresses a new task.

We now draw a bold parallel: "algebraic manipulation of previously acquired knowledge in order to answer a new question" is a plausible definition of the word "reasoning".

Composition rules can be described with very different levels of sophistication. For instance, graph transformer networks (depicted in the figure below) <ref>Bottou, L., LeCun, Y., & Bengio, Y. [http://www.iro.umontreal.ca/~lisa/pointeurs/bottou-lecun-bengio-97.pdf "Global training of document processing systems using graph transformer networks."] In Proc. of computer vision and pattern recognition (pp. 489–493). New York: IEEE Press.</ref> construct specific recognition and training models for each input image using graph transduction algorithms. The specification of the graph transducers then should be viewed as a description of the composition rules.

[[File:figure5.JPG | center]]

== Probabilistic Models ==
Graphical models describe the factorization of joint probability distributions into elementary conditional distributions with specific independence assumptions. The probabilistic rules then induce an algebraic structure on the space of conditional probability distributions, describing relations in an arbitrary set of random variables. Many refinements have been devised to make the parametrization more explicit. The plate notation<ref name=BuW>
Buntine, Wray L [http://arxiv.org/pdf/cs/9412102.pdf"Operations for learning with graphical models"] in The Journal of Artificial Intelligence Research, (1994).
</ref> compactly represents large graphical models with repeated structures that usually share parameters. More recent works propose considerably richer languages to describe large graphical probabilistic models. Such high order languages for describing probabilistic models are expressions of the composition rules described in the previous section.

== Reasoning Systems ==
We are no longer fitting a simple statistical model to data and instead, we are dealing with a more complex model consisting of (a) an algebraic space of models, and (b) composition rules that establish a correspondence between the space of models and the space of questions of interest. We call such an object a "reasoning system".

Reasoning systems are unpredictable and thus vary in expressive power, predictive abilities and computational examples. A few examples include:
*''First order logic reasoning'' - Consider a space of models composed of functions that predict the truth value of first order logic as a function of its free variables. This space is highly constrained by algebraic structure and hence, if we know some of these functions, we can apply logical inference to deduce or constrain other functions. First order logic is highly expressive because the bulk of mathematics can be formalized as first order logic statements <ref>Hilbert, D., & Ackermann, W.[https://www.math.uwaterloo.ca/~snburris/htdocs/scav/hilbert/hilbert.html "Grundzüge der theoretischen Logik."] Berlin: Springer.</ref>. However, this is not sufficient in expressing natural language: every first order logic formula can be expressed in natural language but the converse is not true. Finally, first order logic usually leads to computationally expensive algorithms.

*''Probabilistic reasoning'' - Consider a space of models formed by all the conditional probability distributions associated with a set of predefined random variables. These conditional distributions are highly constrained by algebraic structure and hence, we can apply Bayesian inference to form deductions. Probability models are more computationally inexpensive but this comes at a price of lower expressive power: probability theory can be describe by first order logic but the converse is not true.

*''Causal reasoning'' - The event "it is raining" and "people carry open umbrellas" is highly correlated and predictive: if people carry open umbrellas, then it is likely that it is raining. This does not, however, tell you the consequences of an intervention: banning umbrellas will not stop the train.

*''Newtonian Mechanics'' - Classical mechanics is an example of the great predictive powers of causal reasoning. Newton's three laws of motion make very accurate predictions on the motion of bodies on our universe.

*''Spatial reasoning'' - A change in visual scene with respect to one's change in viewpoint is also subjected to algebraic constraints.

*''Social reasoning'' - Changes of viewpoints also play a very important role in social interactions.

*''Non-falsifiable reasoning'' - Examples of non-falsifiable reasoning include mythology and astrology. Just like non-falsifiable statistical models, non-falsifiable reasoning systems are unlikely to have useful predictive capabilities.

It is desirable to map the universe of reasoning system, but unfortunately, we cannot expect such theoretical advances on schedule. We can however, nourish our intuitions by empirically exploring the capabilities of algebraic structures designed for specific applicative domains.

The replication of essential human cognitive processes such as scene analysis, language understanding, and social interactions form an important class of applications. These processes probably include a form of logical reasoning because are able to explain our conclusions with logical arguments. However, the actual processes happen without conscious involvement suggesting that the full complexity of logic reasoning is not required.

The following sections describe more specific ideas investigating reasoning systems suitable for natural language processing and vision tasks.

== Association and Dissociation ==
We consider again a collection of trainable modules. The word embedding module W computes a continuous representation for each word of the dictionary. The association module is a trainable function that takes two vectors representation space and produces a single vector in the same space, which is suppose to represent the association of the two inputs. Given a sentence segment composed of ''n'' words, the figure below shows how ''n-1'' applications of the association module reduce the sentence segment to a single vector. We would like this vector to be a representation of the meaning of this sentence and each intermediate result to represent the meaning of the corresponding sentence fragment.

[[File:figure6.JPG | center]]

There are many ways of bracketing the same sentence to achieve a different meaning of that sentence. The figure below, for example, corresponds to the bracketing of the sentence "''((the cat) (sat (on (the mat))''". In order to determine which form of bracketing of the sentence splits the sentence into fragments that have the most meaning, we introduce a new scoring module R which takes in a sentence fragment and measures how meaningful is that corresponding sentence fragment.

[[File:figure7.JPG | center]]

The idea is to apply this R module to every intermediate result and summing all of the scores to get a global score. The task then, is to find a bracketing that maximizes this score. There is also the challenge of training these modules to achieve the desired function. The figure below illustrates a model inspired by Collobert et. al.<ref>Collobert, R., & Weston, J. [https://aclweb.org/anthology/P/P07/P07-1071.pdf "Fast semantic extraction using a novel neural network architecture."] In Proc. 45th annual meeting of the association of computational linguistics (ACL) (pp. 560–567).</ref><ref>Collobert, R. [http://ronan.collobert.com/pub/matos/2011_parsing_aistats.pdf "Deep learning for efficient discriminative parsing."] In Proc. artificial intelligence and statistics (AISTAT).</ref> This is a stochastic gradient descent method and during each iteration, a short sentence is randomly selected from a large corpus and bracketed as shown in the figure. An arbitrary word is the then replaced by a random word from the vocabulary. The parameters of all the modules are then adjusted using a simple gradient descent step.

[[File:figure8.JPG | center]]

In order to investigate how well the system maps words to the representation space, all two-word sequences of the 500 most common words were constructed and mapped into the representation space. The figure below shows the closest neighbors in the representation space of some of these sequences.

[[File:figure9.JPG | center]]

The disassociation module D is the opposite of the association model, that is, a trainable function that computes two representation space vectors from a single vector. When its input is a meaningful output of the association module, its output should be the two inputs of the association module. Stacking one instance of the association module and one instance of the dissociation module is equivalent to an auto-encoder.

The association and dissociation modules can be seen similar to the <code>cons</code>, <code>car</code>, and <code>cdr</code> primitives of the Lisp programming languages. These statements are used to construct new objects from two individual objects (<code>cons</code>, "association") or extract the individual objects (<code>car</code> and <code>cdr</code>, "dissociation") from a constructed object. However, there is an important difference. The representation in Lisp is discrete, whereas the representation here is in a continuous vector space. This will limit the depth of structures that can be constructed (because of limited numerical precision), while at the same time it makes other vectors in numerical proximity of a representation also meaningful. This latter property makes search algorithms more efficient as it is possible to follow a gradient (instead of performing discrete jumps). Note that the presented idea of association and dissociation in a vector space is very similar to what is known as Vector Symbolic Architectures.<ref>
[http://arxiv.org/abs/cs/0412059 Gayler, Ross W. "Vector symbolic architectures answer Jackendoff's challenges for cognitive neuroscience." arXiv preprint cs/0412059 (2004).]
</ref>

[[File:figure10.JPG | center]]

Association and dissociation modules are not limited to just natural language processing tasks. A number of state-of-the-art systems for scene categorization and object recognition use a combination of strong local features, such as SIFT or HOG features, consolidated along a pyramidal structure. Similar pyramidal structure has been associated with the visual cortex. Pyramidal structures work poorly as image segmentation tools. Take for example, the figure below which shows that a large convolutional neural network provides good object recognition accuracies but coarse segmentation.

The use of the association-dissociation modules of sort described in this section have been given more a general treatment in recent work on recursive neural networks, which similarly apply a single function to a sequence of inputs in a pairwise fashion to build up distributed representations of data (e.g. natural language sentences or segmented images).<ref>
[http://www.socher.org/uploads/Main/SocherHuvalManningNg_EMNLP2012.pdf Socher, R. et al. "Semantic compositionally though recursive matrix-vector spaces" EMNLP (2012).]
</ref><ref>
[http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf Socher, R. et al. "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank" EMNLP (2013).]
</ref>. A standard recurrent network can also be thought of as a special case of this approach in which the recursive application always proceeds left to right through the input sequence (i.e. there is no branching in the tree produced by unfolding the recursion through time).

[[File:figure11.JPG | center]]

Finally, we envision module that convert image representations into sentence representations and conversely. Given an image, we could parse the image and convert the final image representation into a sentence representation. Conversely, given a sentence, we could produce a sketch of the associated image by similar means.

== Universal Parser ==
The figure below shows a model of short-term memory (STM) capable of two possible actions: (1) inserting a new representation vector into the short-term memory and (2) apply the association module A to two representation vectors taken from the short-term memory and replacing them by the combined representation vector. Each application of the association module is scored using the saliency scoring module R. The algorithm terminates when STM contains a single representation vector and there are no more representation vectors to insert.

[[File:figure12.JPG | center]]

The algorithm design choices determine which data structure is most appropriate for implementing the STM. In the English language, sentences are created by words separated by spaces and therefore it is attractive to implement the STM as a stack and construct a shift/reduce parser.

== More Modules ==
The previous sections discussed the association and dissociation modules. Here, we discuss a few more modules that perform predefined transformations on natural language sentences; modules that implement specific visual reasoning primitives; and modules that bridge the representations of sentences and the representations of images.

*Operator grammars <ref>Harris, Z. S. [https://books.google.ca/books/about/Mathematical_structures_of_language.html?id=qsbuAAAAMAAJ&redir_esc=y "Mathematical structures of language."] Volume 21 of Interscience tracts in pure and applied mathematics.</ref> provide a mathematical description of natural languages based on transformation operators.
*There is also a natural framework for such enhancements in the case of vision. Modules working on the representation vectors can model the consequences of various interventions.

== Representation Space ==
Previous models have functions operating on low dimensional vector space but modules with similar algebraic properties could be defined on a different set of representation spaces. Such choices have a considerable impact on the computational and practice aspects of the training algorithms.
*In order to provide sufficient capabilities, the trainable functions must often be designed with linear parameterizations. The algorithms are simple extensions of the multilayer network training procedures, using back-propagation and stochastic gradient descent.
*Sparse vectors in much higher dimensional spaces are attractive because they provide the opportunity to rely more on trainable modules with linear parameterization.
*The representation space can also be a space of probability distributions defined on a vector of discrete random variables.

== Conclusions ==
The research directions outlined in this paper is intended to advance the practical and conceptual understanding of the relationship between machine learning and machine reasoning. Instead of trying to bridge the gap between machine learning and "all-purpose" inference mechanisms, we can instead algebraically enrich the set of manipulations applicable to a training system and building reasoning abilities from the ground up.

== Bibliography ==
<references />

from Machine Learning to Machine Reasoning

2015-12-09T03:15:11Z

Amirlk: /* Probabilistic Models */

== Introduction ==
Learning and reasoning are both essential abilities associated with intelligence. Consequently, machine learning and machine reasoning have received considerable attention given the short history of computer science. The statistical nature of machine learning is now understood but the ideas behind machine reasoning are much more elusive. Converting ordinary data into a set of logical rules proves to be very challenging: searching the discrete space of symbolic formulas leads to combinatorial explosion <ref>Lighthill, J. [http://www.math.snu.ac.kr/~hichoi/infomath/Articles/Lighthill%20Report.pdf "Artificial intelligence: a general survey."] In Artificial intelligence: a paper symposium. Science Research Council.</ref>. Algorithms for probabilistic inference <ref>Pearl, J. [http://bayes.cs.ucla.edu/BOOK-2K/neuberg-review.pdf "Causality: models, reasoning, and inference."] Cambridge: Cambridge University Press.</ref> still suffer from unfavourable computational properties <ref>Roth, D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.6074&rep=rep1&type=pdf "On the hardness of approximate reasoning"] Artificial Intelligence, 82, 273–302.</ref>. Algorithms for inference do exist but they do however, come at a price of reduced expressive capabilities in logical inference and probabilistic inference.

Humans display neither of these limitations.

The ability to reason is not the same as the ability to make logical inferences. The way that humans reason provides evidence to suggest the existence of a middle layer, already a form of reasoning, but not yet formal or logical. Informal logic is attractive because we hope to avoid the computational complexity that is associated with combinatorial searches in the vast space of discrete logic propositions.

This paper shows how deep learning and multi-task learning can be leveraged as a rudimentary form of reasoning to help solve a task of interest.

This approach is explored along a number of auxiliary tasks.

== Auxiliary Tasks ==

The usefulness of auxiliary tasks were examined within the contexts of two problems; face-based identification and natural language processing. Both these examples show how an easier task (determining whether two faces are different) can be used to boost performance on a harder task (identifying faces) using inference.

'''Face-based Identification'''

Identifying a person from face images is challenging. It remains expensive to collect and label millions of images representing the face of each subject with a good variety of positions and contexts. However, it is easier to collect training data for a slightly different task of telling whether two faces in images represent the same person or not: two faces in the same picture are likely to belong to two different people; two faces in successive video frames are likely to belong to the same person. These two tasks have much in common image analysis primitives, feature extraction, part recognizers trained on the auxiliary task can help solve the original task.

Figure below illustrates a transfer learning strategy involving three trainable models. The preprocessor P computes a compact face representation of the image and the comparator labels the face. We first assemble two preprocessors P and one comparator D and train this model with abundant labels for the auxiliary task. Then we assemble another instance of P with classifier C and train the resulting model using a restrained number of labelled examples from the original task.

[[File:figure1.JPG | center]]

'''Natural Language Processing'''

The auxiliary task in this case (left diagram of figure below) is identifying if a sentence is correct or not. This creates embedding for works in a 50 dimensional space. This embedding can than be used on the primary problem (right diagram of the figure below) of producing tags for the works. Note the shared classification "W" modules shared between the tasks.

[[File:word_transfer.png | center]]

== Reasoning Revisited ==
Little attention has been paid to the rules that describe how to assemble trainable models that perform specific tasks. However, these composition rules play an extremely important rule as they describe algebraic manipulations that let us combine previously acquire knowledge in order to create a model that addresses a new task.

We now draw a bold parallel: "algebraic manipulation of previously acquired knowledge in order to answer a new question" is a plausible definition of the word "reasoning".

Composition rules can be described with very different levels of sophistication. For instance, graph transformer networks (depicted in the figure below) <ref>Bottou, L., LeCun, Y., & Bengio, Y. [http://www.iro.umontreal.ca/~lisa/pointeurs/bottou-lecun-bengio-97.pdf "Global training of document processing systems using graph transformer networks."] In Proc. of computer vision and pattern recognition (pp. 489–493). New York: IEEE Press.</ref> construct specific recognition and training models for each input image using graph transduction algorithms. The specification of the graph transducers then should be viewed as a description of the composition rules.

[[File:figure5.JPG | center]]

== Probabilistic Models ==
Graphical models describe the factorization of joint probability distributions into elementary conditional distributions with specific independence assumptions. The probabilistic rules then induce an algebraic structure on the space of conditional probability distributions, describing relations in an arbitrary set of random variables. Many refinements have been devised to make the parametrization more explicit. The plate notation<ref name=BuW>
Buntine, Wray L [http://arxiv.org/pdf/cs/9412102.pdf"Operations for learning with graphical models"] in The Journal of Artificial Intelligence Research, (1994).
</ref>. compactly represents large graphical models with repeated structures that usually share parameters. More recent works propose considerably richer languages to describe large graphical probabilistic models. Such high order languages for describing probabilistic models are expressions of the composition rules described in the previous section.

== Reasoning Systems ==
We are no longer fitting a simple statistical model to data and instead, we are dealing with a more complex model consisting of (a) an algebraic space of models, and (b) composition rules that establish a correspondence between the space of models and the space of questions of interest. We call such an object a "reasoning system".

Reasoning systems are unpredictable and thus vary in expressive power, predictive abilities and computational examples. A few examples include:
*''First order logic reasoning'' - Consider a space of models composed of functions that predict the truth value of first order logic as a function of its free variables. This space is highly constrained by algebraic structure and hence, if we know some of these functions, we can apply logical inference to deduce or constrain other functions. First order logic is highly expressive because the bulk of mathematics can be formalized as first order logic statements <ref>Hilbert, D., & Ackermann, W.[https://www.math.uwaterloo.ca/~snburris/htdocs/scav/hilbert/hilbert.html "Grundzüge der theoretischen Logik."] Berlin: Springer.</ref>. However, this is not sufficient in expressing natural language: every first order logic formula can be expressed in natural language but the converse is not true. Finally, first order logic usually leads to computationally expensive algorithms.

*''Probabilistic reasoning'' - Consider a space of models formed by all the conditional probability distributions associated with a set of predefined random variables. These conditional distributions are highly constrained by algebraic structure and hence, we can apply Bayesian inference to form deductions. Probability models are more computationally inexpensive but this comes at a price of lower expressive power: probability theory can be describe by first order logic but the converse is not true.

*''Causal reasoning'' - The event "it is raining" and "people carry open umbrellas" is highly correlated and predictive: if people carry open umbrellas, then it is likely that it is raining. This does not, however, tell you the consequences of an intervention: banning umbrellas will not stop the train.

*''Newtonian Mechanics'' - Classical mechanics is an example of the great predictive powers of causal reasoning. Newton's three laws of motion make very accurate predictions on the motion of bodies on our universe.

*''Spatial reasoning'' - A change in visual scene with respect to one's change in viewpoint is also subjected to algebraic constraints.

*''Social reasoning'' - Changes of viewpoints also play a very important role in social interactions.

*''Non-falsifiable reasoning'' - Examples of non-falsifiable reasoning include mythology and astrology. Just like non-falsifiable statistical models, non-falsifiable reasoning systems are unlikely to have useful predictive capabilities.

It is desirable to map the universe of reasoning system, but unfortunately, we cannot expect such theoretical advances on schedule. We can however, nourish our intuitions by empirically exploring the capabilities of algebraic structures designed for specific applicative domains.

The replication of essential human cognitive processes such as scene analysis, language understanding, and social interactions form an important class of applications. These processes probably include a form of logical reasoning because are able to explain our conclusions with logical arguments. However, the actual processes happen without conscious involvement suggesting that the full complexity of logic reasoning is not required.

The following sections describe more specific ideas investigating reasoning systems suitable for natural language processing and vision tasks.

== Association and Dissociation ==
We consider again a collection of trainable modules. The word embedding module W computes a continuous representation for each word of the dictionary. The association module is a trainable function that takes two vectors representation space and produces a single vector in the same space, which is suppose to represent the association of the two inputs. Given a sentence segment composed of ''n'' words, the figure below shows how ''n-1'' applications of the association module reduce the sentence segment to a single vector. We would like this vector to be a representation of the meaning of this sentence and each intermediate result to represent the meaning of the corresponding sentence fragment.

[[File:figure6.JPG | center]]

There are many ways of bracketing the same sentence to achieve a different meaning of that sentence. The figure below, for example, corresponds to the bracketing of the sentence "''((the cat) (sat (on (the mat))''". In order to determine which form of bracketing of the sentence splits the sentence into fragments that have the most meaning, we introduce a new scoring module R which takes in a sentence fragment and measures how meaningful is that corresponding sentence fragment.

[[File:figure7.JPG | center]]

The idea is to apply this R module to every intermediate result and summing all of the scores to get a global score. The task then, is to find a bracketing that maximizes this score. There is also the challenge of training these modules to achieve the desired function. The figure below illustrates a model inspired by Collobert et. al.<ref>Collobert, R., & Weston, J. [https://aclweb.org/anthology/P/P07/P07-1071.pdf "Fast semantic extraction using a novel neural network architecture."] In Proc. 45th annual meeting of the association of computational linguistics (ACL) (pp. 560–567).</ref><ref>Collobert, R. [http://ronan.collobert.com/pub/matos/2011_parsing_aistats.pdf "Deep learning for efficient discriminative parsing."] In Proc. artificial intelligence and statistics (AISTAT).</ref> This is a stochastic gradient descent method and during each iteration, a short sentence is randomly selected from a large corpus and bracketed as shown in the figure. An arbitrary word is the then replaced by a random word from the vocabulary. The parameters of all the modules are then adjusted using a simple gradient descent step.

[[File:figure8.JPG | center]]

In order to investigate how well the system maps words to the representation space, all two-word sequences of the 500 most common words were constructed and mapped into the representation space. The figure below shows the closest neighbors in the representation space of some of these sequences.

[[File:figure9.JPG | center]]

The disassociation module D is the opposite of the association model, that is, a trainable function that computes two representation space vectors from a single vector. When its input is a meaningful output of the association module, its output should be the two inputs of the association module. Stacking one instance of the association module and one instance of the dissociation module is equivalent to an auto-encoder.

The association and dissociation modules can be seen similar to the <code>cons</code>, <code>car</code>, and <code>cdr</code> primitives of the Lisp programming languages. These statements are used to construct new objects from two individual objects (<code>cons</code>, "association") or extract the individual objects (<code>car</code> and <code>cdr</code>, "dissociation") from a constructed object. However, there is an important difference. The representation in Lisp is discrete, whereas the representation here is in a continuous vector space. This will limit the depth of structures that can be constructed (because of limited numerical precision), while at the same time it makes other vectors in numerical proximity of a representation also meaningful. This latter property makes search algorithms more efficient as it is possible to follow a gradient (instead of performing discrete jumps). Note that the presented idea of association and dissociation in a vector space is very similar to what is known as Vector Symbolic Architectures.<ref>
[http://arxiv.org/abs/cs/0412059 Gayler, Ross W. "Vector symbolic architectures answer Jackendoff's challenges for cognitive neuroscience." arXiv preprint cs/0412059 (2004).]
</ref>

[[File:figure10.JPG | center]]

Association and dissociation modules are not limited to just natural language processing tasks. A number of state-of-the-art systems for scene categorization and object recognition use a combination of strong local features, such as SIFT or HOG features, consolidated along a pyramidal structure. Similar pyramidal structure has been associated with the visual cortex. Pyramidal structures work poorly as image segmentation tools. Take for example, the figure below which shows that a large convolutional neural network provides good object recognition accuracies but coarse segmentation.

The use of the association-dissociation modules of sort described in this section have been given more a general treatment in recent work on recursive neural networks, which similarly apply a single function to a sequence of inputs in a pairwise fashion to build up distributed representations of data (e.g. natural language sentences or segmented images).<ref>
[http://www.socher.org/uploads/Main/SocherHuvalManningNg_EMNLP2012.pdf Socher, R. et al. "Semantic compositionally though recursive matrix-vector spaces" EMNLP (2012).]
</ref><ref>
[http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf Socher, R. et al. "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank" EMNLP (2013).]
</ref>. A standard recurrent network can also be thought of as a special case of this approach in which the recursive application always proceeds left to right through the input sequence (i.e. there is no branching in the tree produced by unfolding the recursion through time).

[[File:figure11.JPG | center]]

Finally, we envision module that convert image representations into sentence representations and conversely. Given an image, we could parse the image and convert the final image representation into a sentence representation. Conversely, given a sentence, we could produce a sketch of the associated image by similar means.

== Universal Parser ==
The figure below shows a model of short-term memory (STM) capable of two possible actions: (1) inserting a new representation vector into the short-term memory and (2) apply the association module A to two representation vectors taken from the short-term memory and replacing them by the combined representation vector. Each application of the association module is scored using the saliency scoring module R. The algorithm terminates when STM contains a single representation vector and there are no more representation vectors to insert.

[[File:figure12.JPG | center]]

The algorithm design choices determine which data structure is most appropriate for implementing the STM. In the English language, sentences are created by words separated by spaces and therefore it is attractive to implement the STM as a stack and construct a shift/reduce parser.

== More Modules ==
The previous sections discussed the association and dissociation modules. Here, we discuss a few more modules that perform predefined transformations on natural language sentences; modules that implement specific visual reasoning primitives; and modules that bridge the representations of sentences and the representations of images.

*Operator grammars <ref>Harris, Z. S. [https://books.google.ca/books/about/Mathematical_structures_of_language.html?id=qsbuAAAAMAAJ&redir_esc=y "Mathematical structures of language."] Volume 21 of Interscience tracts in pure and applied mathematics.</ref> provide a mathematical description of natural languages based on transformation operators.
*There is also a natural framework for such enhancements in the case of vision. Modules working on the representation vectors can model the consequences of various interventions.

== Representation Space ==
Previous models have functions operating on low dimensional vector space but modules with similar algebraic properties could be defined on a different set of representation spaces. Such choices have a considerable impact on the computational and practice aspects of the training algorithms.
*In order to provide sufficient capabilities, the trainable functions must often be designed with linear parameterizations. The algorithms are simple extensions of the multilayer network training procedures, using back-propagation and stochastic gradient descent.
*Sparse vectors in much higher dimensional spaces are attractive because they provide the opportunity to rely more on trainable modules with linear parameterization.
*The representation space can also be a space of probability distributions defined on a vector of discrete random variables.

== Conclusions ==
The research directions outlined in this paper is intended to advance the practical and conceptual understanding of the relationship between machine learning and machine reasoning. Instead of trying to bridge the gap between machine learning and "all-purpose" inference mechanisms, we can instead algebraically enrich the set of manipulations applicable to a training system and building reasoning abilities from the ground up.

== Bibliography ==
<references />

from Machine Learning to Machine Reasoning

2015-12-09T03:11:53Z

Amirlk: /* Probabilistic Models */

== Introduction ==
Learning and reasoning are both essential abilities associated with intelligence. Consequently, machine learning and machine reasoning have received considerable attention given the short history of computer science. The statistical nature of machine learning is now understood but the ideas behind machine reasoning are much more elusive. Converting ordinary data into a set of logical rules proves to be very challenging: searching the discrete space of symbolic formulas leads to combinatorial explosion <ref>Lighthill, J. [http://www.math.snu.ac.kr/~hichoi/infomath/Articles/Lighthill%20Report.pdf "Artificial intelligence: a general survey."] In Artificial intelligence: a paper symposium. Science Research Council.</ref>. Algorithms for probabilistic inference <ref>Pearl, J. [http://bayes.cs.ucla.edu/BOOK-2K/neuberg-review.pdf "Causality: models, reasoning, and inference."] Cambridge: Cambridge University Press.</ref> still suffer from unfavourable computational properties <ref>Roth, D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.6074&rep=rep1&type=pdf "On the hardness of approximate reasoning"] Artificial Intelligence, 82, 273–302.</ref>. Algorithms for inference do exist but they do however, come at a price of reduced expressive capabilities in logical inference and probabilistic inference.

Humans display neither of these limitations.

The ability to reason is not the same as the ability to make logical inferences. The way that humans reason provides evidence to suggest the existence of a middle layer, already a form of reasoning, but not yet formal or logical. Informal logic is attractive because we hope to avoid the computational complexity that is associated with combinatorial searches in the vast space of discrete logic propositions.

This paper shows how deep learning and multi-task learning can be leveraged as a rudimentary form of reasoning to help solve a task of interest.

This approach is explored along a number of auxiliary tasks.

== Auxiliary Tasks ==

The usefulness of auxiliary tasks were examined within the contexts of two problems; face-based identification and natural language processing. Both these examples show how an easier task (determining whether two faces are different) can be used to boost performance on a harder task (identifying faces) using inference.

'''Face-based Identification'''

Identifying a person from face images is challenging. It remains expensive to collect and label millions of images representing the face of each subject with a good variety of positions and contexts. However, it is easier to collect training data for a slightly different task of telling whether two faces in images represent the same person or not: two faces in the same picture are likely to belong to two different people; two faces in successive video frames are likely to belong to the same person. These two tasks have much in common image analysis primitives, feature extraction, part recognizers trained on the auxiliary task can help solve the original task.

Figure below illustrates a transfer learning strategy involving three trainable models. The preprocessor P computes a compact face representation of the image and the comparator labels the face. We first assemble two preprocessors P and one comparator D and train this model with abundant labels for the auxiliary task. Then we assemble another instance of P with classifier C and train the resulting model using a restrained number of labelled examples from the original task.

[[File:figure1.JPG | center]]

'''Natural Language Processing'''

The auxiliary task in this case (left diagram of figure below) is identifying if a sentence is correct or not. This creates embedding for works in a 50 dimensional space. This embedding can than be used on the primary problem (right diagram of the figure below) of producing tags for the works. Note the shared classification "W" modules shared between the tasks.

[[File:word_transfer.png | center]]

== Reasoning Revisited ==
Little attention has been paid to the rules that describe how to assemble trainable models that perform specific tasks. However, these composition rules play an extremely important rule as they describe algebraic manipulations that let us combine previously acquire knowledge in order to create a model that addresses a new task.

We now draw a bold parallel: "algebraic manipulation of previously acquired knowledge in order to answer a new question" is a plausible definition of the word "reasoning".

Composition rules can be described with very different levels of sophistication. For instance, graph transformer networks (depicted in the figure below) <ref>Bottou, L., LeCun, Y., & Bengio, Y. [http://www.iro.umontreal.ca/~lisa/pointeurs/bottou-lecun-bengio-97.pdf "Global training of document processing systems using graph transformer networks."] In Proc. of computer vision and pattern recognition (pp. 489–493). New York: IEEE Press.</ref> construct specific recognition and training models for each input image using graph transduction algorithms. The specification of the graph transducers then should be viewed as a description of the composition rules.

[[File:figure5.JPG | center]]

== Probabilistic Models ==
Graphical models describe the factorization of joint probability distributions into elementary conditional distributions with specific independence assumptions. The probabilistic rules then induce an algebraic structure on the space of conditional probability distributions, describing relations in an arbitrary set of random variables. Many refinements have been devised to make the parametrization more explicit. The plate notation (Buntine 1994) compactly represents large graphical models with repeated structures that usually share parameters. More recent works propose considerably richer languages to describe large graphical probabilistic models. Such high order languages for describing probabilistic models are expressions of the composition rules described in the previous section.

== Reasoning Systems ==
We are no longer fitting a simple statistical model to data and instead, we are dealing with a more complex model consisting of (a) an algebraic space of models, and (b) composition rules that establish a correspondence between the space of models and the space of questions of interest. We call such an object a "reasoning system".

Reasoning systems are unpredictable and thus vary in expressive power, predictive abilities and computational examples. A few examples include:
*''First order logic reasoning'' - Consider a space of models composed of functions that predict the truth value of first order logic as a function of its free variables. This space is highly constrained by algebraic structure and hence, if we know some of these functions, we can apply logical inference to deduce or constrain other functions. First order logic is highly expressive because the bulk of mathematics can be formalized as first order logic statements <ref>Hilbert, D., & Ackermann, W.[https://www.math.uwaterloo.ca/~snburris/htdocs/scav/hilbert/hilbert.html "Grundzüge der theoretischen Logik."] Berlin: Springer.</ref>. However, this is not sufficient in expressing natural language: every first order logic formula can be expressed in natural language but the converse is not true. Finally, first order logic usually leads to computationally expensive algorithms.

*''Probabilistic reasoning'' - Consider a space of models formed by all the conditional probability distributions associated with a set of predefined random variables. These conditional distributions are highly constrained by algebraic structure and hence, we can apply Bayesian inference to form deductions. Probability models are more computationally inexpensive but this comes at a price of lower expressive power: probability theory can be describe by first order logic but the converse is not true.

*''Causal reasoning'' - The event "it is raining" and "people carry open umbrellas" is highly correlated and predictive: if people carry open umbrellas, then it is likely that it is raining. This does not, however, tell you the consequences of an intervention: banning umbrellas will not stop the train.

*''Newtonian Mechanics'' - Classical mechanics is an example of the great predictive powers of causal reasoning. Newton's three laws of motion make very accurate predictions on the motion of bodies on our universe.

*''Spatial reasoning'' - A change in visual scene with respect to one's change in viewpoint is also subjected to algebraic constraints.

*''Social reasoning'' - Changes of viewpoints also play a very important role in social interactions.

*''Non-falsifiable reasoning'' - Examples of non-falsifiable reasoning include mythology and astrology. Just like non-falsifiable statistical models, non-falsifiable reasoning systems are unlikely to have useful predictive capabilities.

It is desirable to map the universe of reasoning system, but unfortunately, we cannot expect such theoretical advances on schedule. We can however, nourish our intuitions by empirically exploring the capabilities of algebraic structures designed for specific applicative domains.

The replication of essential human cognitive processes such as scene analysis, language understanding, and social interactions form an important class of applications. These processes probably include a form of logical reasoning because are able to explain our conclusions with logical arguments. However, the actual processes happen without conscious involvement suggesting that the full complexity of logic reasoning is not required.

The following sections describe more specific ideas investigating reasoning systems suitable for natural language processing and vision tasks.

== Association and Dissociation ==
We consider again a collection of trainable modules. The word embedding module W computes a continuous representation for each word of the dictionary. The association module is a trainable function that takes two vectors representation space and produces a single vector in the same space, which is suppose to represent the association of the two inputs. Given a sentence segment composed of ''n'' words, the figure below shows how ''n-1'' applications of the association module reduce the sentence segment to a single vector. We would like this vector to be a representation of the meaning of this sentence and each intermediate result to represent the meaning of the corresponding sentence fragment.

[[File:figure6.JPG | center]]

There are many ways of bracketing the same sentence to achieve a different meaning of that sentence. The figure below, for example, corresponds to the bracketing of the sentence "''((the cat) (sat (on (the mat))''". In order to determine which form of bracketing of the sentence splits the sentence into fragments that have the most meaning, we introduce a new scoring module R which takes in a sentence fragment and measures how meaningful is that corresponding sentence fragment.

[[File:figure7.JPG | center]]

The idea is to apply this R module to every intermediate result and summing all of the scores to get a global score. The task then, is to find a bracketing that maximizes this score. There is also the challenge of training these modules to achieve the desired function. The figure below illustrates a model inspired by Collobert et. al.<ref>Collobert, R., & Weston, J. [https://aclweb.org/anthology/P/P07/P07-1071.pdf "Fast semantic extraction using a novel neural network architecture."] In Proc. 45th annual meeting of the association of computational linguistics (ACL) (pp. 560–567).</ref><ref>Collobert, R. [http://ronan.collobert.com/pub/matos/2011_parsing_aistats.pdf "Deep learning for efficient discriminative parsing."] In Proc. artificial intelligence and statistics (AISTAT).</ref> This is a stochastic gradient descent method and during each iteration, a short sentence is randomly selected from a large corpus and bracketed as shown in the figure. An arbitrary word is the then replaced by a random word from the vocabulary. The parameters of all the modules are then adjusted using a simple gradient descent step.

[[File:figure8.JPG | center]]

In order to investigate how well the system maps words to the representation space, all two-word sequences of the 500 most common words were constructed and mapped into the representation space. The figure below shows the closest neighbors in the representation space of some of these sequences.

[[File:figure9.JPG | center]]

The disassociation module D is the opposite of the association model, that is, a trainable function that computes two representation space vectors from a single vector. When its input is a meaningful output of the association module, its output should be the two inputs of the association module. Stacking one instance of the association module and one instance of the dissociation module is equivalent to an auto-encoder.

The association and dissociation modules can be seen similar to the <code>cons</code>, <code>car</code>, and <code>cdr</code> primitives of the Lisp programming languages. These statements are used to construct new objects from two individual objects (<code>cons</code>, "association") or extract the individual objects (<code>car</code> and <code>cdr</code>, "dissociation") from a constructed object. However, there is an important difference. The representation in Lisp is discrete, whereas the representation here is in a continuous vector space. This will limit the depth of structures that can be constructed (because of limited numerical precision), while at the same time it makes other vectors in numerical proximity of a representation also meaningful. This latter property makes search algorithms more efficient as it is possible to follow a gradient (instead of performing discrete jumps). Note that the presented idea of association and dissociation in a vector space is very similar to what is known as Vector Symbolic Architectures.<ref>
[http://arxiv.org/abs/cs/0412059 Gayler, Ross W. "Vector symbolic architectures answer Jackendoff's challenges for cognitive neuroscience." arXiv preprint cs/0412059 (2004).]
</ref>

[[File:figure10.JPG | center]]

Association and dissociation modules are not limited to just natural language processing tasks. A number of state-of-the-art systems for scene categorization and object recognition use a combination of strong local features, such as SIFT or HOG features, consolidated along a pyramidal structure. Similar pyramidal structure has been associated with the visual cortex. Pyramidal structures work poorly as image segmentation tools. Take for example, the figure below which shows that a large convolutional neural network provides good object recognition accuracies but coarse segmentation.

The use of the association-dissociation modules of sort described in this section have been given more a general treatment in recent work on recursive neural networks, which similarly apply a single function to a sequence of inputs in a pairwise fashion to build up distributed representations of data (e.g. natural language sentences or segmented images).<ref>
[http://www.socher.org/uploads/Main/SocherHuvalManningNg_EMNLP2012.pdf Socher, R. et al. "Semantic compositionally though recursive matrix-vector spaces" EMNLP (2012).]
</ref><ref>
[http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf Socher, R. et al. "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank" EMNLP (2013).]
</ref>. A standard recurrent network can also be thought of as a special case of this approach in which the recursive application always proceeds left to right through the input sequence (i.e. there is no branching in the tree produced by unfolding the recursion through time).

[[File:figure11.JPG | center]]

Finally, we envision module that convert image representations into sentence representations and conversely. Given an image, we could parse the image and convert the final image representation into a sentence representation. Conversely, given a sentence, we could produce a sketch of the associated image by similar means.

== Universal Parser ==
The figure below shows a model of short-term memory (STM) capable of two possible actions: (1) inserting a new representation vector into the short-term memory and (2) apply the association module A to two representation vectors taken from the short-term memory and replacing them by the combined representation vector. Each application of the association module is scored using the saliency scoring module R. The algorithm terminates when STM contains a single representation vector and there are no more representation vectors to insert.

[[File:figure12.JPG | center]]

The algorithm design choices determine which data structure is most appropriate for implementing the STM. In the English language, sentences are created by words separated by spaces and therefore it is attractive to implement the STM as a stack and construct a shift/reduce parser.

== More Modules ==
The previous sections discussed the association and dissociation modules. Here, we discuss a few more modules that perform predefined transformations on natural language sentences; modules that implement specific visual reasoning primitives; and modules that bridge the representations of sentences and the representations of images.

*Operator grammars <ref>Harris, Z. S. [https://books.google.ca/books/about/Mathematical_structures_of_language.html?id=qsbuAAAAMAAJ&redir_esc=y "Mathematical structures of language."] Volume 21 of Interscience tracts in pure and applied mathematics.</ref> provide a mathematical description of natural languages based on transformation operators.
*There is also a natural framework for such enhancements in the case of vision. Modules working on the representation vectors can model the consequences of various interventions.

== Representation Space ==
Previous models have functions operating on low dimensional vector space but modules with similar algebraic properties could be defined on a different set of representation spaces. Such choices have a considerable impact on the computational and practice aspects of the training algorithms.
*In order to provide sufficient capabilities, the trainable functions must often be designed with linear parameterizations. The algorithms are simple extensions of the multilayer network training procedures, using back-propagation and stochastic gradient descent.
*Sparse vectors in much higher dimensional spaces are attractive because they provide the opportunity to rely more on trainable modules with linear parameterization.
*The representation space can also be a space of probability distributions defined on a vector of discrete random variables.

== Conclusions ==
The research directions outlined in this paper is intended to advance the practical and conceptual understanding of the relationship between machine learning and machine reasoning. Instead of trying to bridge the gap between machine learning and "all-purpose" inference mechanisms, we can instead algebraically enrich the set of manipulations applicable to a training system and building reasoning abilities from the ground up.

== Bibliography ==
<references />

from Machine Learning to Machine Reasoning

2015-12-09T03:09:36Z

Amirlk: /* Probabilistic Models */

== Introduction ==
Learning and reasoning are both essential abilities associated with intelligence. Consequently, machine learning and machine reasoning have received considerable attention given the short history of computer science. The statistical nature of machine learning is now understood but the ideas behind machine reasoning are much more elusive. Converting ordinary data into a set of logical rules proves to be very challenging: searching the discrete space of symbolic formulas leads to combinatorial explosion <ref>Lighthill, J. [http://www.math.snu.ac.kr/~hichoi/infomath/Articles/Lighthill%20Report.pdf "Artificial intelligence: a general survey."] In Artificial intelligence: a paper symposium. Science Research Council.</ref>. Algorithms for probabilistic inference <ref>Pearl, J. [http://bayes.cs.ucla.edu/BOOK-2K/neuberg-review.pdf "Causality: models, reasoning, and inference."] Cambridge: Cambridge University Press.</ref> still suffer from unfavourable computational properties <ref>Roth, D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.6074&rep=rep1&type=pdf "On the hardness of approximate reasoning"] Artificial Intelligence, 82, 273–302.</ref>. Algorithms for inference do exist but they do however, come at a price of reduced expressive capabilities in logical inference and probabilistic inference.

Humans display neither of these limitations.

The ability to reason is not the same as the ability to make logical inferences. The way that humans reason provides evidence to suggest the existence of a middle layer, already a form of reasoning, but not yet formal or logical. Informal logic is attractive because we hope to avoid the computational complexity that is associated with combinatorial searches in the vast space of discrete logic propositions.

This paper shows how deep learning and multi-task learning can be leveraged as a rudimentary form of reasoning to help solve a task of interest.

This approach is explored along a number of auxiliary tasks.

== Auxiliary Tasks ==

The usefulness of auxiliary tasks were examined within the contexts of two problems; face-based identification and natural language processing. Both these examples show how an easier task (determining whether two faces are different) can be used to boost performance on a harder task (identifying faces) using inference.

'''Face-based Identification'''

Identifying a person from face images is challenging. It remains expensive to collect and label millions of images representing the face of each subject with a good variety of positions and contexts. However, it is easier to collect training data for a slightly different task of telling whether two faces in images represent the same person or not: two faces in the same picture are likely to belong to two different people; two faces in successive video frames are likely to belong to the same person. These two tasks have much in common image analysis primitives, feature extraction, part recognizers trained on the auxiliary task can help solve the original task.

Figure below illustrates a transfer learning strategy involving three trainable models. The preprocessor P computes a compact face representation of the image and the comparator labels the face. We first assemble two preprocessors P and one comparator D and train this model with abundant labels for the auxiliary task. Then we assemble another instance of P with classifier C and train the resulting model using a restrained number of labelled examples from the original task.

[[File:figure1.JPG | center]]

'''Natural Language Processing'''

The auxiliary task in this case (left diagram of figure below) is identifying if a sentence is correct or not. This creates embedding for works in a 50 dimensional space. This embedding can than be used on the primary problem (right diagram of the figure below) of producing tags for the works. Note the shared classification "W" modules shared between the tasks.

[[File:word_transfer.png | center]]

== Reasoning Revisited ==
Little attention has been paid to the rules that describe how to assemble trainable models that perform specific tasks. However, these composition rules play an extremely important rule as they describe algebraic manipulations that let us combine previously acquire knowledge in order to create a model that addresses a new task.

We now draw a bold parallel: "algebraic manipulation of previously acquired knowledge in order to answer a new question" is a plausible definition of the word "reasoning".

Composition rules can be described with very different levels of sophistication. For instance, graph transformer networks (depicted in the figure below) <ref>Bottou, L., LeCun, Y., & Bengio, Y. [http://www.iro.umontreal.ca/~lisa/pointeurs/bottou-lecun-bengio-97.pdf "Global training of document processing systems using graph transformer networks."] In Proc. of computer vision and pattern recognition (pp. 489–493). New York: IEEE Press.</ref> construct specific recognition and training models for each input image using graph transduction algorithms. The specification of the graph transducers then should be viewed as a description of the composition rules.

[[File:figure5.JPG | center]]

== Probabilistic Models ==
Graphical models describe the factorization of joint probability distributions into elementary conditional distributions with specific independence assumptions. The probabilistic rules then induce an algebraic structure on the space of conditional probability distributions, describing relations in an arbitrary set of random variables.

Many refinements have been devised to make the parametrization more explicit. The plate notation (Buntine 1994) compactly represents large graphical models with repeated structures that usually share parameters.

== Reasoning Systems ==
We are no longer fitting a simple statistical model to data and instead, we are dealing with a more complex model consisting of (a) an algebraic space of models, and (b) composition rules that establish a correspondence between the space of models and the space of questions of interest. We call such an object a "reasoning system".

Reasoning systems are unpredictable and thus vary in expressive power, predictive abilities and computational examples. A few examples include:
*''First order logic reasoning'' - Consider a space of models composed of functions that predict the truth value of first order logic as a function of its free variables. This space is highly constrained by algebraic structure and hence, if we know some of these functions, we can apply logical inference to deduce or constrain other functions. First order logic is highly expressive because the bulk of mathematics can be formalized as first order logic statements <ref>Hilbert, D., & Ackermann, W.[https://www.math.uwaterloo.ca/~snburris/htdocs/scav/hilbert/hilbert.html "Grundzüge der theoretischen Logik."] Berlin: Springer.</ref>. However, this is not sufficient in expressing natural language: every first order logic formula can be expressed in natural language but the converse is not true. Finally, first order logic usually leads to computationally expensive algorithms.

*''Probabilistic reasoning'' - Consider a space of models formed by all the conditional probability distributions associated with a set of predefined random variables. These conditional distributions are highly constrained by algebraic structure and hence, we can apply Bayesian inference to form deductions. Probability models are more computationally inexpensive but this comes at a price of lower expressive power: probability theory can be describe by first order logic but the converse is not true.

*''Causal reasoning'' - The event "it is raining" and "people carry open umbrellas" is highly correlated and predictive: if people carry open umbrellas, then it is likely that it is raining. This does not, however, tell you the consequences of an intervention: banning umbrellas will not stop the train.

*''Newtonian Mechanics'' - Classical mechanics is an example of the great predictive powers of causal reasoning. Newton's three laws of motion make very accurate predictions on the motion of bodies on our universe.

*''Spatial reasoning'' - A change in visual scene with respect to one's change in viewpoint is also subjected to algebraic constraints.

*''Social reasoning'' - Changes of viewpoints also play a very important role in social interactions.

*''Non-falsifiable reasoning'' - Examples of non-falsifiable reasoning include mythology and astrology. Just like non-falsifiable statistical models, non-falsifiable reasoning systems are unlikely to have useful predictive capabilities.

It is desirable to map the universe of reasoning system, but unfortunately, we cannot expect such theoretical advances on schedule. We can however, nourish our intuitions by empirically exploring the capabilities of algebraic structures designed for specific applicative domains.

The replication of essential human cognitive processes such as scene analysis, language understanding, and social interactions form an important class of applications. These processes probably include a form of logical reasoning because are able to explain our conclusions with logical arguments. However, the actual processes happen without conscious involvement suggesting that the full complexity of logic reasoning is not required.

The following sections describe more specific ideas investigating reasoning systems suitable for natural language processing and vision tasks.

== Association and Dissociation ==
We consider again a collection of trainable modules. The word embedding module W computes a continuous representation for each word of the dictionary. The association module is a trainable function that takes two vectors representation space and produces a single vector in the same space, which is suppose to represent the association of the two inputs. Given a sentence segment composed of ''n'' words, the figure below shows how ''n-1'' applications of the association module reduce the sentence segment to a single vector. We would like this vector to be a representation of the meaning of this sentence and each intermediate result to represent the meaning of the corresponding sentence fragment.

[[File:figure6.JPG | center]]

There are many ways of bracketing the same sentence to achieve a different meaning of that sentence. The figure below, for example, corresponds to the bracketing of the sentence "''((the cat) (sat (on (the mat))''". In order to determine which form of bracketing of the sentence splits the sentence into fragments that have the most meaning, we introduce a new scoring module R which takes in a sentence fragment and measures how meaningful is that corresponding sentence fragment.

[[File:figure7.JPG | center]]

The idea is to apply this R module to every intermediate result and summing all of the scores to get a global score. The task then, is to find a bracketing that maximizes this score. There is also the challenge of training these modules to achieve the desired function. The figure below illustrates a model inspired by Collobert et. al.<ref>Collobert, R., & Weston, J. [https://aclweb.org/anthology/P/P07/P07-1071.pdf "Fast semantic extraction using a novel neural network architecture."] In Proc. 45th annual meeting of the association of computational linguistics (ACL) (pp. 560–567).</ref><ref>Collobert, R. [http://ronan.collobert.com/pub/matos/2011_parsing_aistats.pdf "Deep learning for efficient discriminative parsing."] In Proc. artificial intelligence and statistics (AISTAT).</ref> This is a stochastic gradient descent method and during each iteration, a short sentence is randomly selected from a large corpus and bracketed as shown in the figure. An arbitrary word is the then replaced by a random word from the vocabulary. The parameters of all the modules are then adjusted using a simple gradient descent step.

[[File:figure8.JPG | center]]

In order to investigate how well the system maps words to the representation space, all two-word sequences of the 500 most common words were constructed and mapped into the representation space. The figure below shows the closest neighbors in the representation space of some of these sequences.

[[File:figure9.JPG | center]]

The disassociation module D is the opposite of the association model, that is, a trainable function that computes two representation space vectors from a single vector. When its input is a meaningful output of the association module, its output should be the two inputs of the association module. Stacking one instance of the association module and one instance of the dissociation module is equivalent to an auto-encoder.

The association and dissociation modules can be seen similar to the <code>cons</code>, <code>car</code>, and <code>cdr</code> primitives of the Lisp programming languages. These statements are used to construct new objects from two individual objects (<code>cons</code>, "association") or extract the individual objects (<code>car</code> and <code>cdr</code>, "dissociation") from a constructed object. However, there is an important difference. The representation in Lisp is discrete, whereas the representation here is in a continuous vector space. This will limit the depth of structures that can be constructed (because of limited numerical precision), while at the same time it makes other vectors in numerical proximity of a representation also meaningful. This latter property makes search algorithms more efficient as it is possible to follow a gradient (instead of performing discrete jumps). Note that the presented idea of association and dissociation in a vector space is very similar to what is known as Vector Symbolic Architectures.<ref>
[http://arxiv.org/abs/cs/0412059 Gayler, Ross W. "Vector symbolic architectures answer Jackendoff's challenges for cognitive neuroscience." arXiv preprint cs/0412059 (2004).]
</ref>

[[File:figure10.JPG | center]]

Association and dissociation modules are not limited to just natural language processing tasks. A number of state-of-the-art systems for scene categorization and object recognition use a combination of strong local features, such as SIFT or HOG features, consolidated along a pyramidal structure. Similar pyramidal structure has been associated with the visual cortex. Pyramidal structures work poorly as image segmentation tools. Take for example, the figure below which shows that a large convolutional neural network provides good object recognition accuracies but coarse segmentation.

The use of the association-dissociation modules of sort described in this section have been given more a general treatment in recent work on recursive neural networks, which similarly apply a single function to a sequence of inputs in a pairwise fashion to build up distributed representations of data (e.g. natural language sentences or segmented images).<ref>
[http://www.socher.org/uploads/Main/SocherHuvalManningNg_EMNLP2012.pdf Socher, R. et al. "Semantic compositionally though recursive matrix-vector spaces" EMNLP (2012).]
</ref><ref>
[http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf Socher, R. et al. "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank" EMNLP (2013).]
</ref>. A standard recurrent network can also be thought of as a special case of this approach in which the recursive application always proceeds left to right through the input sequence (i.e. there is no branching in the tree produced by unfolding the recursion through time).

[[File:figure11.JPG | center]]

Finally, we envision module that convert image representations into sentence representations and conversely. Given an image, we could parse the image and convert the final image representation into a sentence representation. Conversely, given a sentence, we could produce a sketch of the associated image by similar means.

== Universal Parser ==
The figure below shows a model of short-term memory (STM) capable of two possible actions: (1) inserting a new representation vector into the short-term memory and (2) apply the association module A to two representation vectors taken from the short-term memory and replacing them by the combined representation vector. Each application of the association module is scored using the saliency scoring module R. The algorithm terminates when STM contains a single representation vector and there are no more representation vectors to insert.

[[File:figure12.JPG | center]]

The algorithm design choices determine which data structure is most appropriate for implementing the STM. In the English language, sentences are created by words separated by spaces and therefore it is attractive to implement the STM as a stack and construct a shift/reduce parser.

== More Modules ==
The previous sections discussed the association and dissociation modules. Here, we discuss a few more modules that perform predefined transformations on natural language sentences; modules that implement specific visual reasoning primitives; and modules that bridge the representations of sentences and the representations of images.

*Operator grammars <ref>Harris, Z. S. [https://books.google.ca/books/about/Mathematical_structures_of_language.html?id=qsbuAAAAMAAJ&redir_esc=y "Mathematical structures of language."] Volume 21 of Interscience tracts in pure and applied mathematics.</ref> provide a mathematical description of natural languages based on transformation operators.
*There is also a natural framework for such enhancements in the case of vision. Modules working on the representation vectors can model the consequences of various interventions.

== Representation Space ==
Previous models have functions operating on low dimensional vector space but modules with similar algebraic properties could be defined on a different set of representation spaces. Such choices have a considerable impact on the computational and practice aspects of the training algorithms.
*In order to provide sufficient capabilities, the trainable functions must often be designed with linear parameterizations. The algorithms are simple extensions of the multilayer network training procedures, using back-propagation and stochastic gradient descent.
*Sparse vectors in much higher dimensional spaces are attractive because they provide the opportunity to rely more on trainable modules with linear parameterization.
*The representation space can also be a space of probability distributions defined on a vector of discrete random variables.

== Conclusions ==
The research directions outlined in this paper is intended to advance the practical and conceptual understanding of the relationship between machine learning and machine reasoning. Instead of trying to bridge the gap between machine learning and "all-purpose" inference mechanisms, we can instead algebraically enrich the set of manipulations applicable to a training system and building reasoning abilities from the ground up.

== Bibliography ==
<references />

proposal for STAT946 (Deep Learning) final projects Fall 2015

2015-12-03T02:49:15Z

Amirlk:

'''Project 0:''' (This is just an example)

'''Group members:'''first name family name, first name family name, first name family name

'''Title:''' Sentiment Analysis on Movie Reviews

''' Description:''' The idea and data for this project is taken from http://www.kaggle.com/c/sentiment-analysis-on-movie-reviews.
Sentiment analysis is the problem of determining whether a given string contains positive or negative sentiment. For example, “A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story” contains negative sentiment, but it is not immediately clear which parts of the sentence make it so.
This competition seeks to implement machine learning algorithms that can determine the sentiment of a movie review

'''Project 1:'''

'''Group members:''' Sean Aubin, Brent Komer

'''Title:''' Convolution Neural Networks in SLAM

''' Description:''' We will try to replicate the results reported in [http://arxiv.org/abs/1411.1509 Convolutional Neural Networks-based Place Recognition] using [http://caffe.berkeleyvision.org/ Caffe] and [http://arxiv.org/abs/1409.4842 Google-net]. As a "stretch" goal, we will try to convert the CNN to a spiking neural network (a technique created by Eric Hunsberger) for greater biological plausibility and easier integration with other cognitive systems using Nengo. This work will help Brent with starting his PHD investigating cognitive localisation systems and object manipulation.

'''Project 2:'''

'''Group members:''' Xinran Liu, Fatemeh Karimi, Deepak Rishi & Chris Choi

'''Title:''' Image Classification with Deep Learning

''' Description:''' Our aim is to participate in the Digital Recognizer Kaggle Challenge, where one has to correctly classify the Modified National Institute of Standards and Technology (MNIST) dataset of handwritten numerical digits. For our first approach we propose using a simple Feed-Forward Neural Network to form a baseline for comparison. We then plan on experimenting on different aspects of a Neural Network such as network architecture, activation functions and incorporate a wide variety of training methods.

'''Project 3'''

'''Group members:''' Ri Wang, Maysum Panju, Mahmood Gohari

'''Title:''' Machine Translation Using Neural Networks

'''Description:''' The goal of this project is to translate languages using different types of neural networks and the algorithms described in "Sequence to sequence learning with neural networks." and "Neural machine translation by jointly learning to align and translate". Different vector representations for input sentences (word frequency, Word2Vec, etc) will be used and all combinations of algorithms will be ranked in terms of accuracy.
Our data will mainly be from [http://www.statmt.org/europarl/ Europarl] and [https://tatoeba.org/eng Tatoeba]. The common target language will be English to allow for easier judgement of translation quality.

'''Project 4'''

'''Group members:''' Peter Blouw, Jan Gosmann

'''Title:''' Using Structured Representations in Memory Networks to Perform Question Answering

'''Description:''' Memory networks are machine learning systems that combine memory and inference to perform tasks that involve sophisticated reasoning (see [http://arxiv.org/pdf/1410.3916.pdf here] and [http://arxiv.org/pdf/1502.05698v7.pdf here]). Our goal in this project is to first implement a memory network that replicates prior performance on the bAbl question-answering tasks described in [http://arxiv.org/pdf/1502.05698v7.pdf Weston et al. (2015)]. Then, we hope to improve upon this baseline performance by using more sophisticated representations of the sentences that encode questions being posed to the network. Current implementations often use a bag of words encoding, which throws out important syntactic information that is relevant to determining what a particular question is asking. As such, we will explore the use of things like POS tags, n-gram information, and parse trees to augment memory network performance.

'''Project 5'''

'''Group members:''' Anthony Caterini, Tim Tse

'''Title:''' The Allen AI Science Challenge

'''Description:''' The goal of this project is to create an artificial intelligence model that can answer multiple-choice questions on a grade 8 science exam, with a success rate better than the best 8th graders. This will involve a deep neural network as the underlying model, to help parse the large amount of information needed to answer these questions. The model should also learn, over time, how to make better answers by acquiring more and more data. This is a Kaggle challenge, and the link to the challenge is [https://www.kaggle.com/c/the-allen-ai-science-challenge here]. The data to produce the model will come from the Kaggle website.

'''Project 6'''

'''Group members:''' Valerie Platsko

'''Title:''' Classification for P300-Speller Using Convolutional Neural Networks

''' Description:''' The goal of this project is to replicate (and possibly extend) the results in [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5492691 Convolutional Neural Networks for P300 Detection with Application to Brain-Computer Interfaces], which used convolutional neural networks to recognize P300 responses in recorded EEG and additionally to correctly recognize attended targets.(In the P300-Speller application, letters flash in rows and columns, so a single P300 response is associated with multiple potential targets.) The data in the paper came from http://www.bbci.de/competition/iii/ (dataset II), and there is an additional P300 Speller dataset available from [http://www.bbci.de/competition/ii/ a previous version of the competition].

'''Project 7'''

'''Group members:''' Amirreza Lashkari, Derek Latremouille, Rui Qiao and Luyao Ruan

'''Title:''' What's Cooking?

''' Description:''' Although the best way to distinguish different types of cuisine is to smell and taste, our goal is to predict the type of a cuisine according to its ingredients. Since, the data is text-based, different methods will be used first to get appropriate transformed data for various classification techniques. Different deep neural network algorithms will then be implemented and we will compare their accuracy and complexity. This is a Kaggle challenge (see [https://www.kaggle.com/c/whats-cooking here]).

'''Project 8'''

'''Group members:''' Abdullah Rashwan and Priyank Jaini

'''Title:''' Learning the Parameters for Continuous Distribution Sum-Product Networks using Bayesian Moment Matching

'''Description:''' Sum-Product Networks have generated interest due to their ability to do exact inference in linear time with respect to the size of the network. Parameter learning however still is a problem. We have proposed an online Bayesian Moment Matching algorithm to learn the parameters for discrete distributions, in this work, we are extending the algorithm to learn the parameters for continuous distributions as well.

proposal for STAT946 (Deep Learning) final projects Fall 2015

2015-11-28T21:46:48Z

Amirlk:

'''Project 0:''' (This is just an example)

'''Group members:'''first name family name, first name family name, first name family name

'''Title:''' Sentiment Analysis on Movie Reviews

''' Description:''' The idea and data for this project is taken from http://www.kaggle.com/c/sentiment-analysis-on-movie-reviews.
Sentiment analysis is the problem of determining whether a given string contains positive or negative sentiment. For example, “A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story” contains negative sentiment, but it is not immediately clear which parts of the sentence make it so.
This competition seeks to implement machine learning algorithms that can determine the sentiment of a movie review

'''Project 1:'''

'''Group members:''' Sean Aubin, Brent Komer

'''Title:''' Convolution Neural Networks in SLAM

''' Description:''' We will try to replicate the results reported in [http://arxiv.org/abs/1411.1509 Convolutional Neural Networks-based Place Recognition] using [http://caffe.berkeleyvision.org/ Caffe] and [http://arxiv.org/abs/1409.4842 Google-net]. As a "stretch" goal, we will try to convert the CNN to a spiking neural network (a technique created by Eric Hunsberger) for greater biological plausibility and easier integration with other cognitive systems using Nengo. This work will help Brent with starting his PHD investigating cognitive localisation systems and object manipulation.

'''Project 2:'''

'''Group members:''' Xinran Liu, Fatemeh Karimi, Deepak Rishi & Chris Choi

'''Title:''' Image Classification with Deep Learning

''' Description:''' Our aim is to participate in the Digital Recognizer Kaggle Challenge, where one has to correctly classify the Modified National Institute of Standards and Technology (MNIST) dataset of handwritten numerical digits. For our first approach we propose using a simple Feed-Forward Neural Network to form a baseline for comparison. We then plan on experimenting on different aspects of a Neural Network such as network architecture, activation functions and incorporate a wide variety of training methods.

'''Project 3'''

'''Group members:''' Ri Wang, Maysum Panju, Mahmood Gohari

'''Title:''' Machine Translation Using Neural Networks

'''Description:''' The goal of this project is to translate languages using different types of neural networks and the algorithms described in "Sequence to sequence learning with neural networks." and "Neural machine translation by jointly learning to align and translate". Different vector representations for input sentences (word frequency, Word2Vec, etc) will be used and all combinations of algorithms will be ranked in terms of accuracy.
Our data will mainly be from [http://www.statmt.org/europarl/ Europarl] and [https://tatoeba.org/eng Tatoeba]. The common target language will be English to allow for easier judgement of translation quality.

'''Project 4'''

'''Group members:''' Peter Blouw, Jan Gosmann

'''Title:''' Using Structured Representations in Memory Networks to Perform Question Answering

'''Description:''' Memory networks are machine learning systems that combine memory and inference to perform tasks that involve sophisticated reasoning (see [http://arxiv.org/pdf/1410.3916.pdf here] and [http://arxiv.org/pdf/1502.05698v7.pdf here]). Our goal in this project is to first implement a memory network that replicates prior performance on the bAbl question-answering tasks described in [http://arxiv.org/pdf/1502.05698v7.pdf Weston et al. (2015)]. Then, we hope to improve upon this baseline performance by using more sophisticated representations of the sentences that encode questions being posed to the network. Current implementations often use a bag of words encoding, which throws out important syntactic information that is relevant to determining what a particular question is asking. As such, we will explore the use of things like POS tags, n-gram information, and parse trees to augment memory network performance.

'''Project 5'''

'''Group members:''' Anthony Caterini, Tim Tse

'''Title:''' The Allen AI Science Challenge

'''Description:''' The goal of this project is to create an artificial intelligence model that can answer multiple-choice questions on a grade 8 science exam, with a success rate better than the best 8th graders. This will involve a deep neural network as the underlying model, to help parse the large amount of information needed to answer these questions. The model should also learn, over time, how to make better answers by acquiring more and more data. This is a Kaggle challenge, and the link to the challenge is [https://www.kaggle.com/c/the-allen-ai-science-challenge here]. The data to produce the model will come from the Kaggle website.

'''Project 6'''

'''Group members:''' Valerie Platsko

'''Title:''' Classification for P300-Speller Using Convolutional Neural Networks

''' Description:''' The goal of this project is to replicate (and possibly extend) the results in [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5492691 Convolutional Neural Networks for P300 Detection with Application to Brain-Computer Interfaces], which used convolutional neural networks to recognize P300 responses in recorded EEG and additionally to correctly recognize attended targets.(In the P300-Speller application, letters flash in rows and columns, so a single P300 response is associated with multiple potential targets.) The data in the paper came from http://www.bbci.de/competition/iii/ (dataset II), and there is an additional P300 Speller dataset available from [http://www.bbci.de/competition/ii/ a previous version of the competition].

'''Project 7'''

'''Group members:''' Amirreza Lashkari, Derek Latremouille, Rui Qiao and Luyao Ruan

'''Title:''' Digit Recognizer

''' Description:''' The goal in this competition is to take an image of a handwritten single digit, and determine what that digit is. To do so, a deep neural network will be applied in order to extract features and classify images. The data for this competition were taken from the MNIST dataset. The MNIST ("Modified National Institute of Standards and Technology") dataset is a classic within the Machine Learning community that has been extensively studied. This is a Kaggle challenge (see [https://www.kaggle.com/c/digit-recognizer here]).

'''Project 8'''

'''Group members:''' Abdullah Rashwan and Priyank Jaini

'''Title:''' Learning the Parameters for Continuous Distribution Sum-Product Networks using Bayesian Moment Matching

'''Description:''' Sum-Product Networks have generated interest due to their ability to do exact inference in linear time with respect to the size of the network. Parameter learning however still is a problem. We have proposed an online Bayesian Moment Matching algorithm to learn the parameters for discrete distributions, in this work, we are extending the algorithm to learn the parameters for continuous distributions as well.

memory Networks

2015-11-27T00:29:42Z

Amirlk: /* Extensions to the Basic Implementation */

= Introduction =

Most supervised machine learning models are designed to approximate a function that maps input data to a desirable output (e.g. a class label for an image or a translation of a sentence from one language to another). In this sense,
such models perform inference using a 'fixed' memory in the form of a set of parameters learned during training. For example, the memory of a recurrent neural network is constituted largely by the weights on the recurrent connections to its hidden layer (along with the layer's activities). As is well known, this form of memory is inherently limited given the fixed dimensionality of the weights in question. It is largely for this reason that recurrent nets have difficulty learning long-range dependencies in sequential data. Learning such dependencies, note, requires ''remembering'' items in a sequence for a large number of time steps.

For an interesting class of problems, it is essential for a model to be able to learn long-term dependencies, and to more generally be able to learn to perform inferences using an arbitrarily large memory. Question-answering tasks are paradigmatic of this class of problems, since performing well on such tasks requires remembering all of the information that constitutes a possible answer to the questions being posed. In principle, a recurrent network such as an LSTM could learn to perform QA tasks, but in practice, the amount of information that can be retained by the weights and the hidden states in the LSTM is simply insufficient.

Given this need for a model architecture the combines inference and memory in a sophisticated manner, the authors of this paper propose what they refer to as a "Memory Network". In brief, a memory network is a model that learns to read and write data to an arbitrarily large long-term memory, while also using the data in this memory to perform inferences. The rest of this summary describes the components of a memory network in greater detail, along with some experiments describing its application to a question answering task involving short stories. Below is an example illustrating the model's ability to answer simple questions after being presented with short, multi-sentence stories.

[[File:QA_example.png | frame | centre | Example answers (in red) using a memory network for question answering. ]]

= Model Architecture =

A memory network is composed of a memory <math>\ m</math> (in the form of a collection of vectors or strings, indexed individually as <math>\ m_i</math>), and four possibly learned functions <math>\ I</math>, <math>\ G</math>, <math>\ O</math>, and <math>\ R</math>. The functions are defined as follows:
*<math>\ I</math> maps a natural language expression onto an 'input' feature representation (e.g., a real-valued vector). The input can either be a fact to be added to the memory <math>\ m</math> (e.g. 'John is at the university') , or a question for which an answer is being sought (e.g. 'Where is John?').
*<math>\ G</math> updates the contents of the memory <math>\ m</math> on the basis of an input. The updating can involve simply writing the input to new memory location, or it can involve the modification or compression of existing memories to perform a kind of generalization on the state of the memory.
*<math>\ O</math> produces an 'output' feature representation given a new input and the current state of the memory. The input and output feature representations reside in the same embedding space.
*<math>\ R</math> produces a response given an output feature representation. This response is usually a word or a sentence, but in principle it could also be an action of some kind (e.g. the movement of a robot)

To give a quick overview of how the model operates, an input ''x'' will first be mapped to a feature representation <math>\ I(x)</math> Then, for all memories ''i'', the following update is applied: <math>\ m_i = G(m_i, I(x), m) </math>. This means that each memory is updated on the basis of the input ''x'' and the current state of the memory <math>\ m</math>. In the case where each input is simply written to memory, <math>\ G</math> might function to simply select an index that is currently unused and write <math>\ I(x)</math> to the memory location corresponding to this index. Next, an output feature representation is computed as <math>\ o=O(I(x), m)</math>, and a response, <math>\ r</math>, is computed directly from this feature representation as <math>\ r=R(o)</math>. <math>\ O</math> can be interpreted as retrieving a small selection of memories that are relevant to producing a good response, and <math>\ R</math> actually produces the response given the feature representation produced from the relevant memories by <math>\ O</math>.

= A Basic Implementation =

In a simple version of the memory network, input text is just written to memory in unaltered form. Or in other words, <math>\ I(x) </math> simply returns ''x'', and <math>\ G </math> writes this text to a new memory slot <math>\ m_{N+1} </math> if <math>\ N </math> is the number of currently filled slots. The memory is accordingly an array of strings, and the inclusion of a new string does nothing to modify existing strings.

Given as much, most of the work being done by the model is performed by the functions <math>\ O </math> and <math>\ R </math>. The job of <math>\ O </math> is to produce an output feature representation by selecting <math>\ k </math> supporting memories from <math>\ m </math> on the basis of the input ''x''. In the experiments described in this paper, <math>\ k </math> is set to either 1 or 2. In the case that <math>\ k=1 </math>, the function <math>\ O </math> behaves as follows:

:<math>\ o_1 = O_1(x, m) = argmax_{i = 1 ... N} S_O(x, m_i) </math>

where <math>\ S_O </math> is a function that scores a candidate memory for its compatibility with ''x''. Essentially, one 'supporting' memory is selected from <math>\ m </math> as being most likely to contain the information needed to answer the question posed in <math>\ x </math>. In this case, the output is <math>\ o_1 = [x, m_{o_1}] </math>, or a list containing the input question and one supporting memory. Alternatively, in the case that <math>\ k=2 </math>', a second supporting memory is selected on the basis of the input and the first supporting memory, as follows:

:<math>\ o_2 = O_2(x, m) = argmax_{i = 1 ... N} S_O([x, m_{o_1}], m_i) </math>

Now, the overall output is <math>\ o_2 = [x, m_{o_1}, m_{o_2}] </math>. (These lists are translated into feature representations as described below). Finally, the result of <math>\ O </math> is used to produce a response in the form of a single word via <math>\ R </math> as follows:

:<math>\ r = argmax_{w \epsilon W} S_R([x, m_{o_1}, m_{o_2}], w) </math>

In short, a response is produced by scoring each word in a set of candidate words against the representation produced by the combination of the input and the two supporting memories. The highest scoring candidate word is then chosen as the model's output. The learned portions of <math>\ O </math> and <math>\ R </math> are the parameters of the functions <math>\ S_O </math> and <math>\ S_R </math>, which perform embeddings of the raw text constituting each function argument, and then return the dot product of these two embeddings as a score. Formally, the function <math>\ S_O </math> can be defined as follows; <math>\ S_R </math> is defined analogously:

:<math>\ S_O(x, y) = \Phi_x(x)^T U^T U \Phi_y(y) </math>

In this equation, <math>\ U </math> is an <math>\ n \times D </math> matrix, where ''n'' is the dimension of the embedding space, and ''D'' is the number of features used to represent each function argument. <math>\ \Phi_x</math> and <math>\ \Phi_y </math> are functions that map each argument (which are strings) into the feature space. In the implementations considered in this paper, the feature space makes use of a bag-of-words representation, such that there are 3 binary features for each word in the model's vocabulary. The first feature corresponds to the presence of the word in the input ''x'', the second feature corresponds to the presence of the word in first supporting memory that is being used to select a second supporting memory, and the third feature representation corresponds to the presence of the word in a candidate memory being scored (i.e. either the first or second supporting memory retrieved by the model). Having these different features allows the model to learn distinct representations for the same word depending on whether the word is present in an input question or in a string stored in memory.

Intuitively, it helps to think of the columns of <math>\ U </math> containing distributed representations of each word in the vocabulary (specifically, there are 3 representations and hence 3 columns devoted to each word). The binary feature representation <math>\ \Phi_x(x)</math> maps the text in ''x'' onto a binary feature vector, where 1's in the vector indicate the presence of a particular word in ''x'', and 0's indicate the absence of this word. Note that different elements of the vector will be set to 1 depending on whether the word occurs in the input ''x'' or in a supporting memory (i.e. when ''x'' is a list containing the input and a supporting memory). The matrix-vector multiplications in the above equation effectively extract and sum the distributed representations corresponding to each of the inputs, ''x'' and ''y''. Thus, a single distributed representation is produced for each input, and the resulting score is the dot product of these two vectors (which in turn is the cosine of the angle between the vectors scaled by the product of the vector norms). In the case where ''x'' is the input query, and ''y'' is a candidate memory, a high dot product indicates that the model thinks that the candidate in question is very relevant to answering the input query. In the case where ''x'' is the output of <math>\ O</math> and ''y'' is a candidate response word, a high dot product indicates that the model thinks that the response word is an appropriate answer given the output feature representation produced by <math>\ O</math>. Distinct embedding matrices <math>\ U_O </math> and <math>\ U_R </math> are used to compute the output feature representation and the response.

The goal of learning is find embedding matrices in which the representations produced for queries, supporting memories, and responses are spatially related such that representations of relevant supporting memories are close to the representations of a query, and such that representations of individual words are close to the output feature representations of the questions they answer. The method used to perform this learning is described in the next section.

= The Training Procedure =

Learning is conducted in a supervised manner; the correct responses and supporting memories for each query are provided during training. The following margin-ranking loss function is used in tandem with stochastic gradient descent to learn the parameters of <math>\ U_O </math> and <math>\ U_R </math>, given an input ''x'', a desired response ''r'', and desired supporting memories, <math>\ m_{o_1}</math> and <math>\ m_{o_2}</math>:

:<math> \sum_{f \neq m_{o_1}} max(0, \gamma + S_O (x, f) - S_O (x, m_{o_1})) + \sum_{f^' \neq m_{o_2}} max(0, \gamma + S_O ([x, m_{o_1}], f^') - S_O ([x, m_{o_1}], m_{o_2})) + </math>
:<math> \sum_{r^' \neq r} max(0, \gamma + S_R ([x, m_{o_1}, m_{o_2}], r^') - S_R ([x, m_{o_1}, m_{o_2}], r)) </math>

where <math>\ f</math>, <math>\ f^'</math> and <math>\ r^'</math> correspond to incorrect candidates for the first supporting memory, the second supporting memory, and the output response, and <math> \gamma</math> corresponds to the margin. Intuitively, each term in the sum penalizes the current parameters in proportion to the number of incorrect memories and responses that get assigned a score within the margin of the score of the correct memories and responses. Or in other words, if the score of a correct candidate memory / response is higher than the score of every incorrect candidate by at least <math> \gamma </math>, the cost is 0. Otherwise, the cost is the sum over all of the differences between the incorrect scores (plus gamma) and the correct score. In fact, this is just the standard hinge loss function. Weston et al. speed up gradient descent by sampling incorrect candidates instead of using all incorrect candidates in the calculation of the gradient for each training example.

= Extensions to the Basic Implementation =

Some limitations of the basic implementation are that it can only output single word responses, can only accept strings (rather than sequences) as input, and cannot use its memory in efficient or otherwise interesting ways. The authors propose a series of extensions to the basic implementation described in the previous section that are designed to overcome these limitations. First, they propose a segmenting function that learns when to segment an input sequence into discrete chunks that get written to individual memory slots. The segmenter is modeled similarly to other components, as an embedding model of the form:

<math>
seg(c)=W^T_{seg}U_s\Phi_{seg}(c)
</math>

where <math>W_{seg}</math> is a vector (effectively the parameters of a linear classifier in embedding space), and <math>c</math> is the sequence of input words represented as a bag of words using a separate dictionary. If <math>seg(c) > \gamma</math>, where <math>\gamma</math> is the margin, then this sequence is recognized as a segment.

Second, they propose the use of hashing to avoid scoring a prohibitively large number of candidate memories. Each input corresponding to a query is hashed into some number of buckets, and only candidates within these buckets are scored during the selection of supporting memories. Hashing is done either by making a bucket per word in the model's vocabulary, or by clustering the learning word embeddings, and creating a bucket per cluster.

The most important extension proposed by the authors involves incorporating information about the time at which a memory was written into the scoring function <math>/ S_O </math>. The model needs to be able to make use of such information to correctly answer questions such as "Where was John before the university" (assuming the model has been told some story about John). To handle temporal information, the feature space is extend to include features that indicate the relative time between when two items where written to memory. Formally, this yields the following revised scoring function:

<math>\ S_{O_t}(x, y, y^') = \Phi_x(x)^T U^T U (\Phi_y(y)-\Phi_y(y^')+\Phi_t(x,y,y^'))</math>

The novelty here lies in the feature mapping function <math> \Phi_t </math>, which takes an input and two candidate supporting memories, and returns a binary feature vector as before, but with the addition of three features that indicate whether <math>x</math> is older than <math>y</math>, whether <math>x</math> is older than <math>y^'</math>, and whether <math>y</math> is older than <math>y'</math>. The model loops over all candidate memories, comparing candidates <math>y</math> and <math>y^'</math>. If <math> S_{O_t}(x, y, y^') </math> is greater than 0, then <math>y</math> is preferred over <math>y^'</math>; otherwise, <math>y'</math> is preferred. If <math>y'</math> is preferred, <math>y</math> is replaced by <math>y'</math> and the loop continues to the next candidate memory (i.e. the new <math>y^'</math>. Once the loop finishes iterating over the entire memory, the winning candidate <math>y</math> is chosen as the supporting memory.

Some further extensions concern allowing the model to deal with words not included in it's vocabulary, and to more effectively take advantage of exact word matches between input queries and candidate supporting memories.

Embedding models cannot efficiently use exact word matches due to the low dimensionality <math>n</math>. One solution is to score a pair <math>x,y</math> with <math>\ \Phi_x(x)U^TU\Phi_y(y)+\lambda\Phi_x(x)^T\Phi_y(y) </math> instead. That is, add the “bag of words” matching score to the learned embedding score (with a mixing parameter λ). Another related way is to stay in the n-dimensional embedding space, but to extend the feature representation D with matching features, e.g., one per word. A matching feature indicates if a word occurs in both x and y. That is, we score with <math>\ \Phi_x(x)U^TU\Phi_y(y,x)</math> where <math>\ \Phi_y</math> is actually built conditionally on x: if some of the words in y match the words in x we set those matching features to 1. Unseen words can be modeled similarly by using matching features on their context words. This then gives a feature space of D = 8|W|.

= Related work =

There are two general approaches to performing question answering that have been developed in the literature. The first makes use of a technique known as semantic parsing to map a query expressed in natural language onto a representation in some formal language that directly extracts information from some external memory such as a knowledge base.<ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>P. Liang, M. Jordan, and D. Klein. [http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00127 "Learning dependency-based compositional semantics"]. In Computational Linguistics, 39.2, p. 389-446. </ref>. The second makes use of embedding methods to represent queries and candidate answers (typically extracted from a knowledge base) as high-dimensional vectors. Learning involves producing embeddings that place query vectors close to the vectors that correspond to their answers. <ref>Bordes, A., S. Chopra, and J. Weston. [http://www.thespermwhale.com/jaseweston/papers/fbqa.pdf "Question Answering with Subgraph Embeddings"]. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (2014)</ref>. Memory networks fall under the latter approach, and existing variants of this approach can been seen as special cases of the memory network architecture (e.g., <ref>Bordes, A., S. Chopra, and J. Weston. [http://www.thespermwhale.com/jaseweston/papers/fbqa.pdf "Question Answering with Subgraph Embeddings"]. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (2014)</ref>)

= Experimental Results =

The authors first test a simple memory network (i.e. <math>\ k=1 </math> on a large scale question answering task involving a dataset consisting of 14 million subject-relation-object triplets. Each triplet is stored as an item in memory, and the answers to particular questions are a single entity (i.e. a subject or object) in one these triplets. The results in the table below indicate that memory networks perform quite well on this task. Note that the memory network with 'bag of words' features includes the extension designed to indicate the presence of exact matches of words in a query and a candidate answer. This seems to contribute significantly to improved performance.

[[File:largescale.png | frame | centre | Results on a large-scale QA task.]]

Scoring a query against all 14 million candidate memories is slow, so the the authors also test their hashing techniques and report the resulting speed-accuracy tradeoffs. As shown in the figure below, the use of cluster-based hashing results in a negligible drop in performance while considering only 1/80th of the complete set of items stored in memory.

[[File:hash.png | frame | centre | Memory hashing results on a large-scale QA task.]]

To test their model on more complex tasks that require chains of inference, the authors create a synthetic dataset consisting approximately 7 thousand statements and 3 thousand questions focused on a toy environment comprised of a 4 people, 3 objects, and 5 rooms. Stories involving multiple statements describing actions performed by these people (e.g. moving an object from one room to another) are used to define the question answering tasks. Questions are focused on a single entity mentioned in a story, and the difficulty of the task is controlled by varying how long ago the most recent mention of this entity is in the story (e.g. the most recent statement in the story vs. the 5th most recent statement in the story). The figure at the top of this page gives an example of these tasks being performed.

In the results below, 'Difficulty 1' tasks are those in which the entity being asked about was mentioned in the most recent statement of the story, while 'Difficulty 5' tasks are those in which the entity being asked about was mentioned in one of the 5 most recent statements. Questions about an 'actor' concern a statement that mentions a person but not an object (e.g. "John went to the garden"). The questions may ask for the current location of the person (e.g. "where is John?") or the previous location of the person (e.g. "Where was John before the garden?") (the column labelled "actor w/o before" in the figure below excludes this latter type of question). More complex questions involve asking about the object in a statement that mentions both a person and an object (e.g. "John dropped the milk", the question might "Where is the milk?"). Note that this task is more challenging, since it requires using multiple pieces of information (i.e. where John was, and what he did while he was there). Comparisons using RNNs and LSTMs are also reported, and for multiword responses as in the first figure above, an LSTM is used in place of <math>\ R </math>

[[File:toyqa.png | frame | centre | Test accuracy on a simulated world QA task.]]

What is most notable about these results is that the inclusion of time features in the MemNN seems to be responsible for most of the improvement over RNNs and LSTMs.

= Discussion =

One potential concern about the memory network architecture concerns its generalizability to large values of <math>\ k </math>. To explain, each additional supporting memory increases the number of columns in the embedding matrices by the size of the model's vocabulary. This could become impractical for standard vocabularies with tens of thousands of terms.

A second concern is that the memory network, as described, is engineered to answer very particular kinds of questions (i.e. questions in which the order of events is important). To handle different kinds of questions, different features would likely need to be added (e.g. quantificational features to handle statements involving quantifiers such as 'some', 'many', etc.). This sort of ad-hoc design calls into question whether the architecture is capable of performing scalable, general-purpose question answering.

= Resources =

Memory Network implementations on [https://github.com/facebook/MemNN Github]

= Bibliography =

<references />

memory Networks

2015-11-27T00:18:44Z

Amirlk: /* Extensions to the Basic Implementation */

= Introduction =

Most supervised machine learning models are designed to approximate a function that maps input data to a desirable output (e.g. a class label for an image or a translation of a sentence from one language to another). In this sense,
such models perform inference using a 'fixed' memory in the form of a set of parameters learned during training. For example, the memory of a recurrent neural network is constituted largely by the weights on the recurrent connections to its hidden layer (along with the layer's activities). As is well known, this form of memory is inherently limited given the fixed dimensionality of the weights in question. It is largely for this reason that recurrent nets have difficulty learning long-range dependencies in sequential data. Learning such dependencies, note, requires ''remembering'' items in a sequence for a large number of time steps.

For an interesting class of problems, it is essential for a model to be able to learn long-term dependencies, and to more generally be able to learn to perform inferences using an arbitrarily large memory. Question-answering tasks are paradigmatic of this class of problems, since performing well on such tasks requires remembering all of the information that constitutes a possible answer to the questions being posed. In principle, a recurrent network such as an LSTM could learn to perform QA tasks, but in practice, the amount of information that can be retained by the weights and the hidden states in the LSTM is simply insufficient.

Given this need for a model architecture the combines inference and memory in a sophisticated manner, the authors of this paper propose what they refer to as a "Memory Network". In brief, a memory network is a model that learns to read and write data to an arbitrarily large long-term memory, while also using the data in this memory to perform inferences. The rest of this summary describes the components of a memory network in greater detail, along with some experiments describing its application to a question answering task involving short stories. Below is an example illustrating the model's ability to answer simple questions after being presented with short, multi-sentence stories.

[[File:QA_example.png | frame | centre | Example answers (in red) using a memory network for question answering. ]]

= Model Architecture =

A memory network is composed of a memory <math>\ m</math> (in the form of a collection of vectors or strings, indexed individually as <math>\ m_i</math>), and four possibly learned functions <math>\ I</math>, <math>\ G</math>, <math>\ O</math>, and <math>\ R</math>. The functions are defined as follows:
*<math>\ I</math> maps a natural language expression onto an 'input' feature representation (e.g., a real-valued vector). The input can either be a fact to be added to the memory <math>\ m</math> (e.g. 'John is at the university') , or a question for which an answer is being sought (e.g. 'Where is John?').
*<math>\ G</math> updates the contents of the memory <math>\ m</math> on the basis of an input. The updating can involve simply writing the input to new memory location, or it can involve the modification or compression of existing memories to perform a kind of generalization on the state of the memory.
*<math>\ O</math> produces an 'output' feature representation given a new input and the current state of the memory. The input and output feature representations reside in the same embedding space.
*<math>\ R</math> produces a response given an output feature representation. This response is usually a word or a sentence, but in principle it could also be an action of some kind (e.g. the movement of a robot)

To give a quick overview of how the model operates, an input ''x'' will first be mapped to a feature representation <math>\ I(x)</math> Then, for all memories ''i'', the following update is applied: <math>\ m_i = G(m_i, I(x), m) </math>. This means that each memory is updated on the basis of the input ''x'' and the current state of the memory <math>\ m</math>. In the case where each input is simply written to memory, <math>\ G</math> might function to simply select an index that is currently unused and write <math>\ I(x)</math> to the memory location corresponding to this index. Next, an output feature representation is computed as <math>\ o=O(I(x), m)</math>, and a response, <math>\ r</math>, is computed directly from this feature representation as <math>\ r=R(o)</math>. <math>\ O</math> can be interpreted as retrieving a small selection of memories that are relevant to producing a good response, and <math>\ R</math> actually produces the response given the feature representation produced from the relevant memories by <math>\ O</math>.

= A Basic Implementation =

In a simple version of the memory network, input text is just written to memory in unaltered form. Or in other words, <math>\ I(x) </math> simply returns ''x'', and <math>\ G </math> writes this text to a new memory slot <math>\ m_{N+1} </math> if <math>\ N </math> is the number of currently filled slots. The memory is accordingly an array of strings, and the inclusion of a new string does nothing to modify existing strings.

Given as much, most of the work being done by the model is performed by the functions <math>\ O </math> and <math>\ R </math>. The job of <math>\ O </math> is to produce an output feature representation by selecting <math>\ k </math> supporting memories from <math>\ m </math> on the basis of the input ''x''. In the experiments described in this paper, <math>\ k </math> is set to either 1 or 2. In the case that <math>\ k=1 </math>, the function <math>\ O </math> behaves as follows:

:<math>\ o_1 = O_1(x, m) = argmax_{i = 1 ... N} S_O(x, m_i) </math>

where <math>\ S_O </math> is a function that scores a candidate memory for its compatibility with ''x''. Essentially, one 'supporting' memory is selected from <math>\ m </math> as being most likely to contain the information needed to answer the question posed in <math>\ x </math>. In this case, the output is <math>\ o_1 = [x, m_{o_1}] </math>, or a list containing the input question and one supporting memory. Alternatively, in the case that <math>\ k=2 </math>', a second supporting memory is selected on the basis of the input and the first supporting memory, as follows:

:<math>\ o_2 = O_2(x, m) = argmax_{i = 1 ... N} S_O([x, m_{o_1}], m_i) </math>

Now, the overall output is <math>\ o_2 = [x, m_{o_1}, m_{o_2}] </math>. (These lists are translated into feature representations as described below). Finally, the result of <math>\ O </math> is used to produce a response in the form of a single word via <math>\ R </math> as follows:

:<math>\ r = argmax_{w \epsilon W} S_R([x, m_{o_1}, m_{o_2}], w) </math>

In short, a response is produced by scoring each word in a set of candidate words against the representation produced by the combination of the input and the two supporting memories. The highest scoring candidate word is then chosen as the model's output. The learned portions of <math>\ O </math> and <math>\ R </math> are the parameters of the functions <math>\ S_O </math> and <math>\ S_R </math>, which perform embeddings of the raw text constituting each function argument, and then return the dot product of these two embeddings as a score. Formally, the function <math>\ S_O </math> can be defined as follows; <math>\ S_R </math> is defined analogously:

:<math>\ S_O(x, y) = \Phi_x(x)^T U^T U \Phi_y(y) </math>

In this equation, <math>\ U </math> is an <math>\ n \times D </math> matrix, where ''n'' is the dimension of the embedding space, and ''D'' is the number of features used to represent each function argument. <math>\ \Phi_x</math> and <math>\ \Phi_y </math> are functions that map each argument (which are strings) into the feature space. In the implementations considered in this paper, the feature space makes use of a bag-of-words representation, such that there are 3 binary features for each word in the model's vocabulary. The first feature corresponds to the presence of the word in the input ''x'', the second feature corresponds to the presence of the word in first supporting memory that is being used to select a second supporting memory, and the third feature representation corresponds to the presence of the word in a candidate memory being scored (i.e. either the first or second supporting memory retrieved by the model). Having these different features allows the model to learn distinct representations for the same word depending on whether the word is present in an input question or in a string stored in memory.

Intuitively, it helps to think of the columns of <math>\ U </math> containing distributed representations of each word in the vocabulary (specifically, there are 3 representations and hence 3 columns devoted to each word). The binary feature representation <math>\ \Phi_x(x)</math> maps the text in ''x'' onto a binary feature vector, where 1's in the vector indicate the presence of a particular word in ''x'', and 0's indicate the absence of this word. Note that different elements of the vector will be set to 1 depending on whether the word occurs in the input ''x'' or in a supporting memory (i.e. when ''x'' is a list containing the input and a supporting memory). The matrix-vector multiplications in the above equation effectively extract and sum the distributed representations corresponding to each of the inputs, ''x'' and ''y''. Thus, a single distributed representation is produced for each input, and the resulting score is the dot product of these two vectors (which in turn is the cosine of the angle between the vectors scaled by the product of the vector norms). In the case where ''x'' is the input query, and ''y'' is a candidate memory, a high dot product indicates that the model thinks that the candidate in question is very relevant to answering the input query. In the case where ''x'' is the output of <math>\ O</math> and ''y'' is a candidate response word, a high dot product indicates that the model thinks that the response word is an appropriate answer given the output feature representation produced by <math>\ O</math>. Distinct embedding matrices <math>\ U_O </math> and <math>\ U_R </math> are used to compute the output feature representation and the response.

The goal of learning is find embedding matrices in which the representations produced for queries, supporting memories, and responses are spatially related such that representations of relevant supporting memories are close to the representations of a query, and such that representations of individual words are close to the output feature representations of the questions they answer. The method used to perform this learning is described in the next section.

= The Training Procedure =

Learning is conducted in a supervised manner; the correct responses and supporting memories for each query are provided during training. The following margin-ranking loss function is used in tandem with stochastic gradient descent to learn the parameters of <math>\ U_O </math> and <math>\ U_R </math>, given an input ''x'', a desired response ''r'', and desired supporting memories, <math>\ m_{o_1}</math> and <math>\ m_{o_2}</math>:

:<math> \sum_{f \neq m_{o_1}} max(0, \gamma + S_O (x, f) - S_O (x, m_{o_1})) + \sum_{f^' \neq m_{o_2}} max(0, \gamma + S_O ([x, m_{o_1}], f^') - S_O ([x, m_{o_1}], m_{o_2})) + </math>
:<math> \sum_{r^' \neq r} max(0, \gamma + S_R ([x, m_{o_1}, m_{o_2}], r^') - S_R ([x, m_{o_1}, m_{o_2}], r)) </math>

where <math>\ f</math>, <math>\ f^'</math> and <math>\ r^'</math> correspond to incorrect candidates for the first supporting memory, the second supporting memory, and the output response, and <math> \gamma</math> corresponds to the margin. Intuitively, each term in the sum penalizes the current parameters in proportion to the number of incorrect memories and responses that get assigned a score within the margin of the score of the correct memories and responses. Or in other words, if the score of a correct candidate memory / response is higher than the score of every incorrect candidate by at least <math> \gamma </math>, the cost is 0. Otherwise, the cost is the sum over all of the differences between the incorrect scores (plus gamma) and the correct score. In fact, this is just the standard hinge loss function. Weston et al. speed up gradient descent by sampling incorrect candidates instead of using all incorrect candidates in the calculation of the gradient for each training example.

= Extensions to the Basic Implementation =

Some limitations of the basic implementation are that it can only output single word responses, can only accept strings (rather than sequences) as input, and cannot use its memory in efficient or otherwise interesting ways. The authors propose a series of extensions to the basic implementation described in the previous section that are designed to overcome these limitations. First, they propose a segmenting function that learns when to segment an input sequence into discrete chunks that get written to individual memory slots. The segmenter is modeled similarly to other components, as an embedding model of the form:

<math>
seg(c)=W^T_{seg}U_s\Phi_{seg}(c)
</math>

where <math>W_{seg}</math> is a vector (effectively the parameters of a linear classifier in embedding space), and <math>c</math> is the sequence of input words represented as a bag of words using a separate dictionary. If <math>seg(c) > \gamma</math>, where <math>\gamma</math> is the margin, then this sequence is recognized as a segment.

Second, they propose the use of hashing to avoid scoring a prohibitively large number of candidate memories. Each input corresponding to a query is hashed into some number of buckets, and only candidates within these buckets are scored during the selection of supporting memories. Hashing is done either by making a bucket per word in the model's vocabulary, or by clustering the learning word embeddings, and creating a bucket per cluster.

The most important extension proposed by the authors involves incorporating information about the time at which a memory was written into the scoring function <math>/ S_O </math>. The model needs to be able to make use of such information to correctly answer questions such as "Where was John before the university" (assuming the model has been told some story about John). To handle temporal information, the feature space is extend to include features that indicate the relative time between when two items where written to memory. Formally, this yields the following revised scoring function:

:<math>\ S_{O_t}(x, y, y^') = \Phi_x(x)^T U^T U (\Phi_y(y)-\Phi_y(y^')+\Phi_t(x,y,y^'))</math>

The novelty here lies in the feature mapping function <math> \Phi_t </math>, which takes an input and two candidate supporting memories, and returns a binary feature vector as before, but with the addition of three features that indicate whether <math>x</math> is older than <math>y</math>, whether <math>x</math> is older than <math>y^'</math>, and whether <math>y</math> is older than <math>y'</math>. The model loops over all candidate memories, comparing candidates <math>y</math> and <math>y^'</math>. If <math> S_{O_t}(x, y, y^') </math> is greater than 0, then <math>y</math> is preferred over <math>y^'</math>; otherwise, <math>y'</math> is preferred. If <math>y'</math> is preferred, <math>y</math> is replaced by <math>y'</math> and the loop continues to the next candidate memory (i.e. the new <math>y^'</math>. Once the loop finishes iterating over the entire memory, the winning candidate <math>y</math> is chosen as the supporting memory.

Some further extensions concern allowing the model to deal with words not included in it's vocabulary, and to more effectively take advantage of exact word matches between input queries and candidate supporting memories.

= Related work =

There are two general approaches to performing question answering that have been developed in the literature. The first makes use of a technique known as semantic parsing to map a query expressed in natural language onto a representation in some formal language that directly extracts information from some external memory such as a knowledge base.<ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>P. Liang, M. Jordan, and D. Klein. [http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00127 "Learning dependency-based compositional semantics"]. In Computational Linguistics, 39.2, p. 389-446. </ref>. The second makes use of embedding methods to represent queries and candidate answers (typically extracted from a knowledge base) as high-dimensional vectors. Learning involves producing embeddings that place query vectors close to the vectors that correspond to their answers. <ref>Bordes, A., S. Chopra, and J. Weston. [http://www.thespermwhale.com/jaseweston/papers/fbqa.pdf "Question Answering with Subgraph Embeddings"]. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (2014)</ref>. Memory networks fall under the latter approach, and existing variants of this approach can been seen as special cases of the memory network architecture (e.g., <ref>Bordes, A., S. Chopra, and J. Weston. [http://www.thespermwhale.com/jaseweston/papers/fbqa.pdf "Question Answering with Subgraph Embeddings"]. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (2014)</ref>)

= Experimental Results =

The authors first test a simple memory network (i.e. <math>\ k=1 </math> on a large scale question answering task involving a dataset consisting of 14 million subject-relation-object triplets. Each triplet is stored as an item in memory, and the answers to particular questions are a single entity (i.e. a subject or object) in one these triplets. The results in the table below indicate that memory networks perform quite well on this task. Note that the memory network with 'bag of words' features includes the extension designed to indicate the presence of exact matches of words in a query and a candidate answer. This seems to contribute significantly to improved performance.

[[File:largescale.png | frame | centre | Results on a large-scale QA task.]]

Scoring a query against all 14 million candidate memories is slow, so the the authors also test their hashing techniques and report the resulting speed-accuracy tradeoffs. As shown in the figure below, the use of cluster-based hashing results in a negligible drop in performance while considering only 1/80th of the complete set of items stored in memory.

[[File:hash.png | frame | centre | Memory hashing results on a large-scale QA task.]]

To test their model on more complex tasks that require chains of inference, the authors create a synthetic dataset consisting approximately 7 thousand statements and 3 thousand questions focused on a toy environment comprised of a 4 people, 3 objects, and 5 rooms. Stories involving multiple statements describing actions performed by these people (e.g. moving an object from one room to another) are used to define the question answering tasks. Questions are focused on a single entity mentioned in a story, and the difficulty of the task is controlled by varying how long ago the most recent mention of this entity is in the story (e.g. the most recent statement in the story vs. the 5th most recent statement in the story). The figure at the top of this page gives an example of these tasks being performed.

In the results below, 'Difficulty 1' tasks are those in which the entity being asked about was mentioned in the most recent statement of the story, while 'Difficulty 5' tasks are those in which the entity being asked about was mentioned in one of the 5 most recent statements. Questions about an 'actor' concern a statement that mentions a person but not an object (e.g. "John went to the garden"). The questions may ask for the current location of the person (e.g. "where is John?") or the previous location of the person (e.g. "Where was John before the garden?") (the column labelled "actor w/o before" in the figure below excludes this latter type of question). More complex questions involve asking about the object in a statement that mentions both a person and an object (e.g. "John dropped the milk", the question might "Where is the milk?"). Note that this task is more challenging, since it requires using multiple pieces of information (i.e. where John was, and what he did while he was there). Comparisons using RNNs and LSTMs are also reported, and for multiword responses as in the first figure above, an LSTM is used in place of <math>\ R </math>

[[File:toyqa.png | frame | centre | Test accuracy on a simulated world QA task.]]

What is most notable about these results is that the inclusion of time features in the MemNN seems to be responsible for most of the improvement over RNNs and LSTMs.

= Discussion =

One potential concern about the memory network architecture concerns its generalizability to large values of <math>\ k </math>. To explain, each additional supporting memory increases the number of columns in the embedding matrices by the size of the model's vocabulary. This could become impractical for standard vocabularies with tens of thousands of terms.

A second concern is that the memory network, as described, is engineered to answer very particular kinds of questions (i.e. questions in which the order of events is important). To handle different kinds of questions, different features would likely need to be added (e.g. quantificational features to handle statements involving quantifiers such as 'some', 'many', etc.). This sort of ad-hoc design calls into question whether the architecture is capable of performing scalable, general-purpose question answering.

= Resources =

Memory Network implementations on [https://github.com/facebook/MemNN Github]

= Bibliography =

<references />

memory Networks

2015-11-27T00:16:07Z

Amirlk: /* Extensions to the Basic Implementation */

= Introduction =

Most supervised machine learning models are designed to approximate a function that maps input data to a desirable output (e.g. a class label for an image or a translation of a sentence from one language to another). In this sense,
such models perform inference using a 'fixed' memory in the form of a set of parameters learned during training. For example, the memory of a recurrent neural network is constituted largely by the weights on the recurrent connections to its hidden layer (along with the layer's activities). As is well known, this form of memory is inherently limited given the fixed dimensionality of the weights in question. It is largely for this reason that recurrent nets have difficulty learning long-range dependencies in sequential data. Learning such dependencies, note, requires ''remembering'' items in a sequence for a large number of time steps.

For an interesting class of problems, it is essential for a model to be able to learn long-term dependencies, and to more generally be able to learn to perform inferences using an arbitrarily large memory. Question-answering tasks are paradigmatic of this class of problems, since performing well on such tasks requires remembering all of the information that constitutes a possible answer to the questions being posed. In principle, a recurrent network such as an LSTM could learn to perform QA tasks, but in practice, the amount of information that can be retained by the weights and the hidden states in the LSTM is simply insufficient.

Given this need for a model architecture the combines inference and memory in a sophisticated manner, the authors of this paper propose what they refer to as a "Memory Network". In brief, a memory network is a model that learns to read and write data to an arbitrarily large long-term memory, while also using the data in this memory to perform inferences. The rest of this summary describes the components of a memory network in greater detail, along with some experiments describing its application to a question answering task involving short stories. Below is an example illustrating the model's ability to answer simple questions after being presented with short, multi-sentence stories.

[[File:QA_example.png | frame | centre | Example answers (in red) using a memory network for question answering. ]]

= Model Architecture =

A memory network is composed of a memory <math>\ m</math> (in the form of a collection of vectors or strings, indexed individually as <math>\ m_i</math>), and four possibly learned functions <math>\ I</math>, <math>\ G</math>, <math>\ O</math>, and <math>\ R</math>. The functions are defined as follows:
*<math>\ I</math> maps a natural language expression onto an 'input' feature representation (e.g., a real-valued vector). The input can either be a fact to be added to the memory <math>\ m</math> (e.g. 'John is at the university') , or a question for which an answer is being sought (e.g. 'Where is John?').
*<math>\ G</math> updates the contents of the memory <math>\ m</math> on the basis of an input. The updating can involve simply writing the input to new memory location, or it can involve the modification or compression of existing memories to perform a kind of generalization on the state of the memory.
*<math>\ O</math> produces an 'output' feature representation given a new input and the current state of the memory. The input and output feature representations reside in the same embedding space.
*<math>\ R</math> produces a response given an output feature representation. This response is usually a word or a sentence, but in principle it could also be an action of some kind (e.g. the movement of a robot)

To give a quick overview of how the model operates, an input ''x'' will first be mapped to a feature representation <math>\ I(x)</math> Then, for all memories ''i'', the following update is applied: <math>\ m_i = G(m_i, I(x), m) </math>. This means that each memory is updated on the basis of the input ''x'' and the current state of the memory <math>\ m</math>. In the case where each input is simply written to memory, <math>\ G</math> might function to simply select an index that is currently unused and write <math>\ I(x)</math> to the memory location corresponding to this index. Next, an output feature representation is computed as <math>\ o=O(I(x), m)</math>, and a response, <math>\ r</math>, is computed directly from this feature representation as <math>\ r=R(o)</math>. <math>\ O</math> can be interpreted as retrieving a small selection of memories that are relevant to producing a good response, and <math>\ R</math> actually produces the response given the feature representation produced from the relevant memories by <math>\ O</math>.

= A Basic Implementation =

In a simple version of the memory network, input text is just written to memory in unaltered form. Or in other words, <math>\ I(x) </math> simply returns ''x'', and <math>\ G </math> writes this text to a new memory slot <math>\ m_{N+1} </math> if <math>\ N </math> is the number of currently filled slots. The memory is accordingly an array of strings, and the inclusion of a new string does nothing to modify existing strings.

Given as much, most of the work being done by the model is performed by the functions <math>\ O </math> and <math>\ R </math>. The job of <math>\ O </math> is to produce an output feature representation by selecting <math>\ k </math> supporting memories from <math>\ m </math> on the basis of the input ''x''. In the experiments described in this paper, <math>\ k </math> is set to either 1 or 2. In the case that <math>\ k=1 </math>, the function <math>\ O </math> behaves as follows:

:<math>\ o_1 = O_1(x, m) = argmax_{i = 1 ... N} S_O(x, m_i) </math>

where <math>\ S_O </math> is a function that scores a candidate memory for its compatibility with ''x''. Essentially, one 'supporting' memory is selected from <math>\ m </math> as being most likely to contain the information needed to answer the question posed in <math>\ x </math>. In this case, the output is <math>\ o_1 = [x, m_{o_1}] </math>, or a list containing the input question and one supporting memory. Alternatively, in the case that <math>\ k=2 </math>', a second supporting memory is selected on the basis of the input and the first supporting memory, as follows:

:<math>\ o_2 = O_2(x, m) = argmax_{i = 1 ... N} S_O([x, m_{o_1}], m_i) </math>

Now, the overall output is <math>\ o_2 = [x, m_{o_1}, m_{o_2}] </math>. (These lists are translated into feature representations as described below). Finally, the result of <math>\ O </math> is used to produce a response in the form of a single word via <math>\ R </math> as follows:

:<math>\ r = argmax_{w \epsilon W} S_R([x, m_{o_1}, m_{o_2}], w) </math>

In short, a response is produced by scoring each word in a set of candidate words against the representation produced by the combination of the input and the two supporting memories. The highest scoring candidate word is then chosen as the model's output. The learned portions of <math>\ O </math> and <math>\ R </math> are the parameters of the functions <math>\ S_O </math> and <math>\ S_R </math>, which perform embeddings of the raw text constituting each function argument, and then return the dot product of these two embeddings as a score. Formally, the function <math>\ S_O </math> can be defined as follows; <math>\ S_R </math> is defined analogously:

:<math>\ S_O(x, y) = \Phi_x(x)^T U^T U \Phi_y(y) </math>

In this equation, <math>\ U </math> is an <math>\ n \times D </math> matrix, where ''n'' is the dimension of the embedding space, and ''D'' is the number of features used to represent each function argument. <math>\ \Phi_x</math> and <math>\ \Phi_y </math> are functions that map each argument (which are strings) into the feature space. In the implementations considered in this paper, the feature space makes use of a bag-of-words representation, such that there are 3 binary features for each word in the model's vocabulary. The first feature corresponds to the presence of the word in the input ''x'', the second feature corresponds to the presence of the word in first supporting memory that is being used to select a second supporting memory, and the third feature representation corresponds to the presence of the word in a candidate memory being scored (i.e. either the first or second supporting memory retrieved by the model). Having these different features allows the model to learn distinct representations for the same word depending on whether the word is present in an input question or in a string stored in memory.

Intuitively, it helps to think of the columns of <math>\ U </math> containing distributed representations of each word in the vocabulary (specifically, there are 3 representations and hence 3 columns devoted to each word). The binary feature representation <math>\ \Phi_x(x)</math> maps the text in ''x'' onto a binary feature vector, where 1's in the vector indicate the presence of a particular word in ''x'', and 0's indicate the absence of this word. Note that different elements of the vector will be set to 1 depending on whether the word occurs in the input ''x'' or in a supporting memory (i.e. when ''x'' is a list containing the input and a supporting memory). The matrix-vector multiplications in the above equation effectively extract and sum the distributed representations corresponding to each of the inputs, ''x'' and ''y''. Thus, a single distributed representation is produced for each input, and the resulting score is the dot product of these two vectors (which in turn is the cosine of the angle between the vectors scaled by the product of the vector norms). In the case where ''x'' is the input query, and ''y'' is a candidate memory, a high dot product indicates that the model thinks that the candidate in question is very relevant to answering the input query. In the case where ''x'' is the output of <math>\ O</math> and ''y'' is a candidate response word, a high dot product indicates that the model thinks that the response word is an appropriate answer given the output feature representation produced by <math>\ O</math>. Distinct embedding matrices <math>\ U_O </math> and <math>\ U_R </math> are used to compute the output feature representation and the response.

The goal of learning is find embedding matrices in which the representations produced for queries, supporting memories, and responses are spatially related such that representations of relevant supporting memories are close to the representations of a query, and such that representations of individual words are close to the output feature representations of the questions they answer. The method used to perform this learning is described in the next section.

= The Training Procedure =

Learning is conducted in a supervised manner; the correct responses and supporting memories for each query are provided during training. The following margin-ranking loss function is used in tandem with stochastic gradient descent to learn the parameters of <math>\ U_O </math> and <math>\ U_R </math>, given an input ''x'', a desired response ''r'', and desired supporting memories, <math>\ m_{o_1}</math> and <math>\ m_{o_2}</math>:

:<math> \sum_{f \neq m_{o_1}} max(0, \gamma + S_O (x, f) - S_O (x, m_{o_1})) + \sum_{f^' \neq m_{o_2}} max(0, \gamma + S_O ([x, m_{o_1}], f^') - S_O ([x, m_{o_1}], m_{o_2})) + </math>
:<math> \sum_{r^' \neq r} max(0, \gamma + S_R ([x, m_{o_1}, m_{o_2}], r^') - S_R ([x, m_{o_1}, m_{o_2}], r)) </math>

where <math>\ f</math>, <math>\ f^'</math> and <math>\ r^'</math> correspond to incorrect candidates for the first supporting memory, the second supporting memory, and the output response, and <math> \gamma</math> corresponds to the margin. Intuitively, each term in the sum penalizes the current parameters in proportion to the number of incorrect memories and responses that get assigned a score within the margin of the score of the correct memories and responses. Or in other words, if the score of a correct candidate memory / response is higher than the score of every incorrect candidate by at least <math> \gamma </math>, the cost is 0. Otherwise, the cost is the sum over all of the differences between the incorrect scores (plus gamma) and the correct score. In fact, this is just the standard hinge loss function. Weston et al. speed up gradient descent by sampling incorrect candidates instead of using all incorrect candidates in the calculation of the gradient for each training example.

= Extensions to the Basic Implementation =

Some limitations of the basic implementation are that it can only output single word responses, can only accept strings (rather than sequences) as input, and cannot use its memory in efficient or otherwise interesting ways. The authors propose a series of extensions to the basic implementation described in the previous section that are designed to overcome these limitations. First, they propose a segmenting function that learns when to segment an input sequence into discrete chunks that get written to individual memory slots. The segmenter is modeled similarly to other components, as an embedding model of the form:

<math>
seg(c)=W^T_{seg}U_s\Phi_{seg}
</math>

Second, they propose the use of hashing to avoid scoring a prohibitively large number of candidate memories. Each input corresponding to a query is hashed into some number of buckets, and only candidates within these buckets are scored during the selection of supporting memories. Hashing is done either by making a bucket per word in the model's vocabulary, or by clustering the learning word embeddings, and creating a bucket per cluster.

The most important extension proposed by the authors involves incorporating information about the time at which a memory was written into the scoring function <math>/ S_O </math>. The model needs to be able to make use of such information to correctly answer questions such as "Where was John before the university" (assuming the model has been told some story about John). To handle temporal information, the feature space is extend to include features that indicate the relative time between when two items where written to memory. Formally, this yields the following revised scoring function:

:<math>\ S_{O_t}(x, y, y^') = \Phi_x(x)^T U^T U (\Phi_y(y)-\Phi_y(y^')+\Phi_t(x,y,y^'))</math>

The novelty here lies in the feature mapping function <math> \Phi_t </math>, which takes an input and two candidate supporting memories, and returns a binary feature vector as before, but with the addition of three features that indicate whether <math>x</math> is older than <math>y</math>, whether <math>x</math> is older than <math>y^'</math>, and whether <math>y</math> is older than <math>y'</math>. The model loops over all candidate memories, comparing candidates <math>y</math> and <math>y^'</math>. If <math> S_{O_t}(x, y, y^') </math> is greater than 0, then <math>y</math> is preferred over <math>y^'</math>; otherwise, <math>y'</math> is preferred. If <math>y'</math> is preferred, <math>y</math> is replaced by <math>y'</math> and the loop continues to the next candidate memory (i.e. the new <math>y^'</math>. Once the loop finishes iterating over the entire memory, the winning candidate <math>y</math> is chosen as the supporting memory.

Some further extensions concern allowing the model to deal with words not included in it's vocabulary, and to more effectively take advantage of exact word matches between input queries and candidate supporting memories.

= Related work =

There are two general approaches to performing question answering that have been developed in the literature. The first makes use of a technique known as semantic parsing to map a query expressed in natural language onto a representation in some formal language that directly extracts information from some external memory such as a knowledge base.<ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>P. Liang, M. Jordan, and D. Klein. [http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00127 "Learning dependency-based compositional semantics"]. In Computational Linguistics, 39.2, p. 389-446. </ref>. The second makes use of embedding methods to represent queries and candidate answers (typically extracted from a knowledge base) as high-dimensional vectors. Learning involves producing embeddings that place query vectors close to the vectors that correspond to their answers. <ref>Bordes, A., S. Chopra, and J. Weston. [http://www.thespermwhale.com/jaseweston/papers/fbqa.pdf "Question Answering with Subgraph Embeddings"]. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (2014)</ref>. Memory networks fall under the latter approach, and existing variants of this approach can been seen as special cases of the memory network architecture (e.g., <ref>Bordes, A., S. Chopra, and J. Weston. [http://www.thespermwhale.com/jaseweston/papers/fbqa.pdf "Question Answering with Subgraph Embeddings"]. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (2014)</ref>)

= Experimental Results =

The authors first test a simple memory network (i.e. <math>\ k=1 </math> on a large scale question answering task involving a dataset consisting of 14 million subject-relation-object triplets. Each triplet is stored as an item in memory, and the answers to particular questions are a single entity (i.e. a subject or object) in one these triplets. The results in the table below indicate that memory networks perform quite well on this task. Note that the memory network with 'bag of words' features includes the extension designed to indicate the presence of exact matches of words in a query and a candidate answer. This seems to contribute significantly to improved performance.

[[File:largescale.png | frame | centre | Results on a large-scale QA task.]]

Scoring a query against all 14 million candidate memories is slow, so the the authors also test their hashing techniques and report the resulting speed-accuracy tradeoffs. As shown in the figure below, the use of cluster-based hashing results in a negligible drop in performance while considering only 1/80th of the complete set of items stored in memory.

[[File:hash.png | frame | centre | Memory hashing results on a large-scale QA task.]]

To test their model on more complex tasks that require chains of inference, the authors create a synthetic dataset consisting approximately 7 thousand statements and 3 thousand questions focused on a toy environment comprised of a 4 people, 3 objects, and 5 rooms. Stories involving multiple statements describing actions performed by these people (e.g. moving an object from one room to another) are used to define the question answering tasks. Questions are focused on a single entity mentioned in a story, and the difficulty of the task is controlled by varying how long ago the most recent mention of this entity is in the story (e.g. the most recent statement in the story vs. the 5th most recent statement in the story). The figure at the top of this page gives an example of these tasks being performed.

In the results below, 'Difficulty 1' tasks are those in which the entity being asked about was mentioned in the most recent statement of the story, while 'Difficulty 5' tasks are those in which the entity being asked about was mentioned in one of the 5 most recent statements. Questions about an 'actor' concern a statement that mentions a person but not an object (e.g. "John went to the garden"). The questions may ask for the current location of the person (e.g. "where is John?") or the previous location of the person (e.g. "Where was John before the garden?") (the column labelled "actor w/o before" in the figure below excludes this latter type of question). More complex questions involve asking about the object in a statement that mentions both a person and an object (e.g. "John dropped the milk", the question might "Where is the milk?"). Note that this task is more challenging, since it requires using multiple pieces of information (i.e. where John was, and what he did while he was there). Comparisons using RNNs and LSTMs are also reported, and for multiword responses as in the first figure above, an LSTM is used in place of <math>\ R </math>

[[File:toyqa.png | frame | centre | Test accuracy on a simulated world QA task.]]

What is most notable about these results is that the inclusion of time features in the MemNN seems to be responsible for most of the improvement over RNNs and LSTMs.

= Discussion =

One potential concern about the memory network architecture concerns its generalizability to large values of <math>\ k </math>. To explain, each additional supporting memory increases the number of columns in the embedding matrices by the size of the model's vocabulary. This could become impractical for standard vocabularies with tens of thousands of terms.

A second concern is that the memory network, as described, is engineered to answer very particular kinds of questions (i.e. questions in which the order of events is important). To handle different kinds of questions, different features would likely need to be added (e.g. quantificational features to handle statements involving quantifiers such as 'some', 'many', etc.). This sort of ad-hoc design calls into question whether the architecture is capable of performing scalable, general-purpose question answering.

= Resources =

Memory Network implementations on [https://github.com/facebook/MemNN Github]

= Bibliography =

<references />

f15Stat946PaperSignUp

2015-11-26T23:10:12Z

Amirlk: /* Set A */

=[https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/listofpapers1.pdf List of Papers]=

= Record your contributions [https://docs.google.com/spreadsheets/d/1A_0ej3S6ns3bBMwWLS4pwA6zDLz_0Ivwujj-d1Gr9eo/edit?usp=sharing here:]=

Use the following notations:

S: You have written a summary on the paper

T: You had technical contribution on a paper (excluding the paper that you present from set A or critique from set B)

E: You had editorial contribution on a paper (excluding the paper that you present from set A or critique from set B)

[http://goo.gl/forms/RASFRZXoxJ Your feedback on presentations]

=Set A=
{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="60pt"|Date
|width="100pt"|Name
|width="30pt"|Paper number
|width="400pt"|Title
|width="30pt"|Link to the paper
|width="30pt"|Link to the summary
|-
|Oct 16 || pascal poupart || || Guest Lecturer||||
|-
|Oct 16 ||pascal poupart || ||Guest Lecturer ||||
|-
|Oct 23 || Ali Ghodsi || || Lecturer||||
|-
|Oct 23 || Ali Ghodsi || || Lecturer||||
|-
|Oct 23 ||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]
|-
|Oct 23 || Deepak Rishi || || Parsing natural scenes and natural language with recursive neural networks || [http://www-nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf Paper] || [[Parsing natural scenes and natural language with recursive neural networks | Summary]]
|-
|Oct 30 || Ali Ghodsi || || Lecturer||||
|-
|Oct 30 || Ali Ghodsi || || Lecturer||||
|-
|Oct 30 ||Rui Qiao || ||Going deeper with convolutions || [http://arxiv.org/pdf/1409.4842v1.pdf Paper]|| [[GoingDeeperWithConvolutions|Summary]]
|-
|Oct 30 ||Amirreza Lashkari|| 21 ||Overfeat: integrated recognition, localization and detection using convolutional networks. || [http://arxiv.org/pdf/1312.6229v4.pdf Paper]|| [[Overfeat: integrated recognition, localization and detection using convolutional networks|Summary]]
|-
|Nov 6 || Ali Ghodsi || || Lecturer||||
|-
|Nov 6 || Ali Ghodsi || || Lecturer||||
|-
|Nov 6 || Anthony Caterini ||56 || Human-level control through deep reinforcement learning ||[http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf Paper]|| [[Human-level control through deep reinforcement learning|Summary]]
|-
|Nov 6 || Sean Aubin || ||Learning Hierarchical Features for Scene Labeling ||[http://yann.lecun.com/exdb/publis/pdf/farabet-pami-13.pdf Paper]||[[Learning Hierarchical Features for Scene Labeling|Summary]]
|-
|Nov 13|| Mike Hynes || 12 ||Speech recognition with deep recurrent neural networks || [http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf Paper] || [[Graves et al., Speech recognition with deep recurrent neural networks|Summary]]
|-
|Nov 13 || Tim Tse || || Question Answering with Subgraph Embeddings || [http://arxiv.org/pdf/1406.3676v3.pdf Paper] || [[Question Answering with Subgraph Embeddings | Summary ]]
|-
|Nov 13 || Maysum Panju || ||Neural machine translation by jointly learning to align and translate ||[http://arxiv.org/pdf/1409.0473v6.pdf Paper] || [[Neural Machine Translation: Jointly Learning to Align and Translate|Summary]]
|-
|Nov 13 || Abdullah Rashwan || || Deep neural networks for acoustic modeling in speech recognition. ||[http://research.microsoft.com/pubs/171498/HintonDengYuEtAl-SPM2012.pdf paper]|| [[Deep neural networks for acoustic modeling in speech recognition| Summary]]
|-
|Nov 20 || Valerie Platsko || ||Natural language processing (almost) from scratch. ||[http://arxiv.org/pdf/1103.0398.pdf Paper]|| [[Natural language processing (almost) from scratch. | Summary]]
|-
|Nov 20 || Brent Komer || ||Show, Attend and Tell: Neural Image Caption Generation with Visual Attention || [http://arxiv.org/pdf/1502.03044v2.pdf Paper]||[[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention|Summary]]
|-
|Nov 20 || Luyao Ruan || || Dropout: A Simple Way to Prevent Neural Networks from Overfitting || [https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf Paper]|| [[dropout | Summary]]
|-
|Nov 20 || Ali Mahdipour || || The human splicing code reveals new insights into the genetic determinants of disease ||[https://www.sciencemag.org/content/347/6218/1254806.full.pdf Paper] || [[Genetics | Summary]]
|-
|Nov 27 ||Mahmood Gohari || ||Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships ||[http://pubs.acs.org/doi/pdf/10.1021/ci500747n paper]|| [[Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships|Summary]]
|-
|Nov 27 || Derek Latremouille || ||Learning Fast Approximations of Sparse Coding || [http://yann.lecun.com/exdb/publis/pdf/gregor-icml-10.pdf Paper] ||[[Learning Fast Approximations of Sparse Coding|Summary]]
|-
|Nov 27 ||Xinran Liu || ||ImageNet Classification with Deep Convolutional Neural Networks ||[http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Paper]||[[ImageNet Classification with Deep Convolutional Neural Networks|Summary]]
|-
|TBA ||Ali Sarhadi|| ||Strategies for Training Large Scale Neural Network Language Models|| [http://www.msr-waypoint.com/pubs/175561/ASRU-2011.pdf Paper]||[[Strategies for Training Large Scale Neural Network Language Models|Summary]]
|-
|Nov 27 || Peter Blouw|| ||Memory Networks.|| [http://arxiv.org/pdf/1410.3916v10.pdf Paper]|| [[Memory Networks|Summary]]
|-
|Dec 4 || Chris Choi || || On the difficulty of training recurrent neural networks || [http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf Paper] || [[On the difficulty of training recurrent neural networks | Summary]]
|-
|Dec 4 || Fatemeh Karimi || ||MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION||[http://arxiv.org/pdf/1412.7755v2.pdf Paper]||[[MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION | Summary]]
|-
|Dec 4 || Jan Gosmann || || On the Number of Linear Regions of Deep Neural Networks || [http://arxiv.org/pdf/1402.1869v2.pdf Paper] || [[On the Number of Linear Regions of Deep Neural Networks | Summary]]
|-
|Dec 4 || Dylan Drover || 54 || Semi-supervised Learning with Deep Generative Models || [http://papers.nips.cc/paper/5352-semi-supervised-learning-with-deep-generative-models.pdf Paper] || [[Semi-supervised Learning with Deep Generative Models | Summary]]
|-
|}
|}

=Set B=

{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="100pt"|Name
|width="30pt"|Paper number
|width="400pt"|Title
|width="30pt"|Link to the paper
|width="30pt"|Link to the summary
|-
|Anthony Caterini ||1 ||The Manifold Tangent Classifier ||[http://papers.nips.cc/paper/4409-the-manifold-tangent-classifier.pdf Paper]|| [[The Manifold Tangent Classifier|Summary]]
|-
|Jan Gosmann ||2 || Neural Turing machines || [http://arxiv.org/abs/1410.5401 Paper] || [[Neural Turing Machines|Summary]]
|-
|Brent Komer ||3 || Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers || [http://arxiv.org/pdf/1202.2160v2.pdf Paper] || [[Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines|Summary]]
|-
|Sean Aubin ||4 || Deep Sparse Rectifier Neural Networks || [http://jmlr.csail.mit.edu/proceedings/papers/v15/glorot11a/glorot11a.pdf Paper] || [[Deep Sparse Rectifier Neural Networks|Summary]]
|-
|Peter Blouw||5 || Generating text with recurrent neural networks || [http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf Paper] || [[Generating text with recurrent neural networks|Summary]]
|-
|Tim Tse||6 || From Machine Learning to Machine Reasoning || [http://research.microsoft.com/pubs/206768/mlj-2013.pdf Paper] || [[From Machine Learning to Machine Reasoning | Summary ]]
|-
|Rui Qiao|| 7 || Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation || [http://arxiv.org/pdf/1406.1078v3.pdf Paper] || [[Learning Phrase Representations|Summary]]
|-
|Ftemeh Karimi|| 8 || Very Deep Convoloutional Networks for Large-Scale Image Recognition || [http://arxiv.org/pdf/1409.1556.pdf Paper] || [[Very Deep Convoloutional Networks for Large-Scale Image Recognition|Summary]]
|-
|Amirreza Lashkari|| 9 || Distributed Representations of Words and Phrases and their Compositionality || [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Paper] || [[Distributed Representations of Words and Phrases and their Compositionality|Summary]]
|-
|Xinran Liu|| 10 || Joint training of a convolutional network and a graphical model for human pose estimation || [http://papers.nips.cc/paper/5573-joint-training-of-a-convolutional-network-and-a-graphical-model-for-human-pose-estimation.pdf Paper] || [[Joint training of a convolutional network and a graphical model for human pose estimation|Summary]]
|-
|Chris Choi|| 11 || Learning Long-Range Vision for Autonomous Off-Road Driving || [http://yann.lecun.com/exdb/publis/pdf/hadsell-jfr-09.pdf Paper] || [[Learning Long-Range Vision for Autonomous Off-Road Driving|Summary]]
|-
|Luyao Ruan|| 12 || Deep Learning of the tissue-regulated splicing code || [http://bioinformatics.oxfordjournals.org/content/30/12/i121.full.pdf+html Paper] || [[Deep Learning of the tissue-regulated splicing code| Summary]]
|-
|Abdullah Rashwan|| 13 || Deep Convolutional Neural Networks For LVCSR || [http://www.cs.toronto.edu/~asamir/papers/icassp13_cnn.pdf paper] || [[Deep Convolutional Neural Networks For LVCSR| Summary]]
|-
|Mahmood Gohari|| 14 || On using very large target vocabulary for neural machine translation || [http://arxiv.org/pdf/1412.2007v2.pdf paper] || [[On using very large target vocabulary for neural machine translation| Summary]]
|-
|Valerie Platsko|| 15 || Learning Convolutional Feature Hierarchies for Visual Recognition || [http://papers.nips.cc/paper/4133-learning-convolutional-feature-hierarchies-for-visual-recognition Paper] || [[Learning Convolutional Feature Hierarchies for Visual Recognition | Summary]]
|-
|Derek Latremouille|| 16 || The Wake-Sleep Algorithm for Unsupervised Neural Networks || [http://www.gatsby.ucl.ac.uk/~dayan/papers/hdfn95.pdf Paper] || [[The Wake-Sleep Algorithm for Unsupervised Neural Networks | Summary]]
|-
|Ri Wang|| 17 || Continuous space language models || [https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenSemester2_2009_10/sdarticle.pdf Paper] || [[Continuous space language models | Summary]]
|-
|Deepak Rishi|| 18 || Extracting and Composing Robust Features with Denoising Autoencoders || [http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf Paper] || [[Extracting and Composing Robust Features with Denoising Autoencoders | Summary]]
|-
|Maysum Panju|| 19 || A fast learning algorithm for deep belief nets || [https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf Paper] || [[A fast learning algorithm for deep belief nets | Summary]]
|-
|Michael Hynes|| 20 || The loss surfaces of multilayer networks || [http://arxiv.org/abs/1412.0233 Paper] || [[The loss surfaces of multilayer networks (Choromanska et al.) | Summary]]
|-
|Dylan Drover|| 21 || Deep Generative Stochastic Networks Trainable by Backprop || [http://jmlr.org/proceedings/papers/v32/bengio14.pdf Paper] || [[Deep Generative Stochastic Networks Trainable by Backprop| Summary]]

f15Stat946PaperSignUp

2015-11-26T23:09:33Z

Amirlk: /* Set A */

=[https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/listofpapers1.pdf List of Papers]=

= Record your contributions [https://docs.google.com/spreadsheets/d/1A_0ej3S6ns3bBMwWLS4pwA6zDLz_0Ivwujj-d1Gr9eo/edit?usp=sharing here:]=

Use the following notations:

S: You have written a summary on the paper

T: You had technical contribution on a paper (excluding the paper that you present from set A or critique from set B)

E: You had editorial contribution on a paper (excluding the paper that you present from set A or critique from set B)

[http://goo.gl/forms/RASFRZXoxJ Your feedback on presentations]

=Set A=
{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="60pt"|Date
|width="100pt"|Name
|width="30pt"|Paper number
|width="400pt"|Title
|width="30pt"|Link to the paper
|width="30pt"|Link to the summary
|-
|Oct 16 || pascal poupart || || Guest Lecturer||||
|-
|Oct 16 ||pascal poupart || ||Guest Lecturer ||||
|-
|Oct 23 || Ali Ghodsi || || Lecturer||||
|-
|Oct 23 || Ali Ghodsi || || Lecturer||||
|-
|Oct 23 ||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]
|-
|Oct 23 || Deepak Rishi || || Parsing natural scenes and natural language with recursive neural networks || [http://www-nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf Paper] || [[Parsing natural scenes and natural language with recursive neural networks | Summary]]
|-
|Oct 30 || Ali Ghodsi || || Lecturer||||
|-
|Oct 30 || Ali Ghodsi || || Lecturer||||
|-
|Oct 30 ||Rui Qiao || ||Going deeper with convolutions || [http://arxiv.org/pdf/1409.4842v1.pdf Paper]|| [[GoingDeeperWithConvolutions|Summary]]
|-
|Oct 30 ||Amirreza Lashkari|| 21 ||Overfeat: integrated recognition, localization and detection using convolutional networks. || [http://arxiv.org/pdf/1312.6229v4.pdf Paper]|| [[Overfeat: integrated recognition, localization and detection using convolutional networks|Summary]]
|-
|Nov 6 || Ali Ghodsi || || Lecturer||||
|-
|Nov 6 || Ali Ghodsi || || Lecturer||||
|-
|Nov 6 || Anthony Caterini ||56 || Human-level control through deep reinforcement learning ||[http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf Paper]|| [[Human-level control through deep reinforcement learning|Summary]]
|-
|Nov 6 || Sean Aubin || ||Learning Hierarchical Features for Scene Labeling ||[http://yann.lecun.com/exdb/publis/pdf/farabet-pami-13.pdf Paper]||[[Learning Hierarchical Features for Scene Labeling|Summary]]
|-
|Nov 13|| Mike Hynes || 12 ||Speech recognition with deep recurrent neural networks || [http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf Paper] || [[Graves et al., Speech recognition with deep recurrent neural networks|Summary]]
|-
|Nov 13 || Tim Tse || || Question Answering with Subgraph Embeddings || [http://arxiv.org/pdf/1406.3676v3.pdf Paper] || [[Question Answering with Subgraph Embeddings | Summary ]]
|-
|Nov 13 || Maysum Panju || ||Neural machine translation by jointly learning to align and translate ||[http://arxiv.org/pdf/1409.0473v6.pdf Paper] || [[Neural Machine Translation: Jointly Learning to Align and Translate|Summary]]
|-
|Nov 13 || Abdullah Rashwan || || Deep neural networks for acoustic modeling in speech recognition. ||[http://research.microsoft.com/pubs/171498/HintonDengYuEtAl-SPM2012.pdf paper]|| [[Deep neural networks for acoustic modeling in speech recognition| Summary]]
|-
|Nov 20 || Valerie Platsko || ||Natural language processing (almost) from scratch. ||[http://arxiv.org/pdf/1103.0398.pdf Paper]|| [[Natural language processing (almost) from scratch. | Summary]]
|-
|Nov 20 || Brent Komer || ||Show, Attend and Tell: Neural Image Caption Generation with Visual Attention || [http://arxiv.org/pdf/1502.03044v2.pdf Paper]||[[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention|Summary]]
|-
|Nov 20 || Luyao Ruan || || Dropout: A Simple Way to Prevent Neural Networks from Overfitting || [https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf Paper]|| [[dropout | Summary]]
|-
|Nov 20 || Ali Mahdipour || || The human splicing code reveals new insights into the genetic determinants of disease ||[https://www.sciencemag.org/content/347/6218/1254806.full.pdf Paper] || [[Genetics | Summary]]
|-
|Nov 27 ||Mahmood Gohari || ||Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships ||[http://pubs.acs.org/doi/pdf/10.1021/ci500747n paper]|| [[Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships|Summary]]
|-
|Nov 27 || Derek Latremouille || ||Learning Fast Approximations of Sparse Coding || [http://yann.lecun.com/exdb/publis/pdf/gregor-icml-10.pdf Paper] ||[[Learning Fast Approximations of Sparse Coding|Summary]]
|-
|Nov 27 ||Xinran Liu || ||ImageNet Classification with Deep Convolutional Neural Networks ||[http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Paper]||[[ImageNet Classification with Deep Convolutional Neural Networks|Summary]]
|-
|TBA ||Ali Sarhadi|| ||Strategies for Training Large Scale Neural Network Language Models|| [http://www.msr-waypoint.com/pubs/175561/ASRU-2011.pdf Paper]||[[Strategies for Training Large Scale Neural Network Language Models|Summary]]
|-
|Nov 27 || Peter Blouw|| ||Memory Networks.|| [http://arxiv.org/pdf/1410.3916v10.pdf Paper]|| [[Memory Networks|Summary]]
|-
|Dec 4 || Chris Choi || || On the difficulty of training recurrent neural networks || [http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf Paper] || [[On the difficulty of training recurrent neural networks | Summary]]
|-
|Dec 4 || Fatemeh Karimi || ||MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION||[http://arxiv.org/pdf/1412.7755v2.pdf Paper]||[[MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION | Summary]]
|-
|Dec 4 || Jan Gosmann || || On the Number of Linear Regions of Deep Neural Networks || [http://arxiv.org/abs/1402.1869 Paper] || [[On the Number of Linear Regions of Deep Neural Networks | Summary]]
|-
|Dec 4 || Dylan Drover || 54 || Semi-supervised Learning with Deep Generative Models || [http://papers.nips.cc/paper/5352-semi-supervised-learning-with-deep-generative-models.pdf Paper] || [[Semi-supervised Learning with Deep Generative Models | Summary]]
|-
|}
|}

=Set B=

{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="100pt"|Name
|width="30pt"|Paper number
|width="400pt"|Title
|width="30pt"|Link to the paper
|width="30pt"|Link to the summary
|-
|Anthony Caterini ||1 ||The Manifold Tangent Classifier ||[http://papers.nips.cc/paper/4409-the-manifold-tangent-classifier.pdf Paper]|| [[The Manifold Tangent Classifier|Summary]]
|-
|Jan Gosmann ||2 || Neural Turing machines || [http://arxiv.org/abs/1410.5401 Paper] || [[Neural Turing Machines|Summary]]
|-
|Brent Komer ||3 || Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers || [http://arxiv.org/pdf/1202.2160v2.pdf Paper] || [[Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines|Summary]]
|-
|Sean Aubin ||4 || Deep Sparse Rectifier Neural Networks || [http://jmlr.csail.mit.edu/proceedings/papers/v15/glorot11a/glorot11a.pdf Paper] || [[Deep Sparse Rectifier Neural Networks|Summary]]
|-
|Peter Blouw||5 || Generating text with recurrent neural networks || [http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf Paper] || [[Generating text with recurrent neural networks|Summary]]
|-
|Tim Tse||6 || From Machine Learning to Machine Reasoning || [http://research.microsoft.com/pubs/206768/mlj-2013.pdf Paper] || [[From Machine Learning to Machine Reasoning | Summary ]]
|-
|Rui Qiao|| 7 || Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation || [http://arxiv.org/pdf/1406.1078v3.pdf Paper] || [[Learning Phrase Representations|Summary]]
|-
|Ftemeh Karimi|| 8 || Very Deep Convoloutional Networks for Large-Scale Image Recognition || [http://arxiv.org/pdf/1409.1556.pdf Paper] || [[Very Deep Convoloutional Networks for Large-Scale Image Recognition|Summary]]
|-
|Amirreza Lashkari|| 9 || Distributed Representations of Words and Phrases and their Compositionality || [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Paper] || [[Distributed Representations of Words and Phrases and their Compositionality|Summary]]
|-
|Xinran Liu|| 10 || Joint training of a convolutional network and a graphical model for human pose estimation || [http://papers.nips.cc/paper/5573-joint-training-of-a-convolutional-network-and-a-graphical-model-for-human-pose-estimation.pdf Paper] || [[Joint training of a convolutional network and a graphical model for human pose estimation|Summary]]
|-
|Chris Choi|| 11 || Learning Long-Range Vision for Autonomous Off-Road Driving || [http://yann.lecun.com/exdb/publis/pdf/hadsell-jfr-09.pdf Paper] || [[Learning Long-Range Vision for Autonomous Off-Road Driving|Summary]]
|-
|Luyao Ruan|| 12 || Deep Learning of the tissue-regulated splicing code || [http://bioinformatics.oxfordjournals.org/content/30/12/i121.full.pdf+html Paper] || [[Deep Learning of the tissue-regulated splicing code| Summary]]
|-
|Abdullah Rashwan|| 13 || Deep Convolutional Neural Networks For LVCSR || [http://www.cs.toronto.edu/~asamir/papers/icassp13_cnn.pdf paper] || [[Deep Convolutional Neural Networks For LVCSR| Summary]]
|-
|Mahmood Gohari|| 14 || On using very large target vocabulary for neural machine translation || [http://arxiv.org/pdf/1412.2007v2.pdf paper] || [[On using very large target vocabulary for neural machine translation| Summary]]
|-
|Valerie Platsko|| 15 || Learning Convolutional Feature Hierarchies for Visual Recognition || [http://papers.nips.cc/paper/4133-learning-convolutional-feature-hierarchies-for-visual-recognition Paper] || [[Learning Convolutional Feature Hierarchies for Visual Recognition | Summary]]
|-
|Derek Latremouille|| 16 || The Wake-Sleep Algorithm for Unsupervised Neural Networks || [http://www.gatsby.ucl.ac.uk/~dayan/papers/hdfn95.pdf Paper] || [[The Wake-Sleep Algorithm for Unsupervised Neural Networks | Summary]]
|-
|Ri Wang|| 17 || Continuous space language models || [https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenSemester2_2009_10/sdarticle.pdf Paper] || [[Continuous space language models | Summary]]
|-
|Deepak Rishi|| 18 || Extracting and Composing Robust Features with Denoising Autoencoders || [http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf Paper] || [[Extracting and Composing Robust Features with Denoising Autoencoders | Summary]]
|-
|Maysum Panju|| 19 || A fast learning algorithm for deep belief nets || [https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf Paper] || [[A fast learning algorithm for deep belief nets | Summary]]
|-
|Michael Hynes|| 20 || The loss surfaces of multilayer networks || [http://arxiv.org/abs/1412.0233 Paper] || [[The loss surfaces of multilayer networks (Choromanska et al.) | Summary]]
|-
|Dylan Drover|| 21 || Deep Generative Stochastic Networks Trainable by Backprop || [http://jmlr.org/proceedings/papers/v32/bengio14.pdf Paper] || [[Deep Generative Stochastic Networks Trainable by Backprop| Summary]]

f15Stat946PaperSignUp

2015-11-26T23:08:37Z

Amirlk: /* Set A */

=[https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/listofpapers1.pdf List of Papers]=

= Record your contributions [https://docs.google.com/spreadsheets/d/1A_0ej3S6ns3bBMwWLS4pwA6zDLz_0Ivwujj-d1Gr9eo/edit?usp=sharing here:]=

Use the following notations:

S: You have written a summary on the paper

T: You had technical contribution on a paper (excluding the paper that you present from set A or critique from set B)

E: You had editorial contribution on a paper (excluding the paper that you present from set A or critique from set B)

[http://goo.gl/forms/RASFRZXoxJ Your feedback on presentations]

=Set A=
{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="60pt"|Date
|width="100pt"|Name
|width="30pt"|Paper number
|width="400pt"|Title
|width="30pt"|Link to the paper
|width="30pt"|Link to the summary
|-
|Oct 16 || pascal poupart || || Guest Lecturer||||
|-
|Oct 16 ||pascal poupart || ||Guest Lecturer ||||
|-
|Oct 23 || Ali Ghodsi || || Lecturer||||
|-
|Oct 23 || Ali Ghodsi || || Lecturer||||
|-
|Oct 23 ||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]
|-
|Oct 23 || Deepak Rishi || || Parsing natural scenes and natural language with recursive neural networks || [http://www-nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf Paper] || [[Parsing natural scenes and natural language with recursive neural networks | Summary]]
|-
|Oct 30 || Ali Ghodsi || || Lecturer||||
|-
|Oct 30 || Ali Ghodsi || || Lecturer||||
|-
|Oct 30 ||Rui Qiao || ||Going deeper with convolutions || [http://arxiv.org/pdf/1409.4842v1.pdf Paper]|| [[GoingDeeperWithConvolutions|Summary]]
|-
|Oct 30 ||Amirreza Lashkari|| 21 ||Overfeat: integrated recognition, localization and detection using convolutional networks. || [http://arxiv.org/pdf/1312.6229v4.pdf Paper]|| [[Overfeat: integrated recognition, localization and detection using convolutional networks|Summary]]
|-
|Nov 6 || Ali Ghodsi || || Lecturer||||
|-
|Nov 6 || Ali Ghodsi || || Lecturer||||
|-
|Nov 6 || Anthony Caterini ||56 || Human-level control through deep reinforcement learning ||[http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf Paper]|| [[Human-level control through deep reinforcement learning|Summary]]
|-
|Nov 6 || Sean Aubin || ||Learning Hierarchical Features for Scene Labeling ||[http://yann.lecun.com/exdb/publis/pdf/farabet-pami-13.pdf Paper]||[[Learning Hierarchical Features for Scene Labeling|Summary]]
|-
|Nov 13|| Mike Hynes || 12 ||Speech recognition with deep recurrent neural networks || [http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf Paper] || [[Graves et al., Speech recognition with deep recurrent neural networks|Summary]]
|-
|Nov 13 || Tim Tse || || Question Answering with Subgraph Embeddings || [http://arxiv.org/pdf/1406.3676v3.pdf Paper] || [[Question Answering with Subgraph Embeddings | Summary ]]
|-
|Nov 13 || Maysum Panju || ||Neural machine translation by jointly learning to align and translate ||[http://arxiv.org/pdf/1409.0473v6.pdf Paper] || [[Neural Machine Translation: Jointly Learning to Align and Translate|Summary]]
|-
|Nov 13 || Abdullah Rashwan || || Deep neural networks for acoustic modeling in speech recognition. ||[http://research.microsoft.com/pubs/171498/HintonDengYuEtAl-SPM2012.pdf paper]|| [[Deep neural networks for acoustic modeling in speech recognition| Summary]]
|-
|Nov 20 || Valerie Platsko || ||Natural language processing (almost) from scratch. ||[http://arxiv.org/pdf/1103.0398.pdf Paper]|| [[Natural language processing (almost) from scratch. | Summary]]
|-
|Nov 20 || Brent Komer || ||Show, Attend and Tell: Neural Image Caption Generation with Visual Attention || [http://arxiv.org/pdf/1502.03044v2.pdf Paper]||[[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention|Summary]]
|-
|Nov 20 || Luyao Ruan || || Dropout: A Simple Way to Prevent Neural Networks from Overfitting || [https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf Paper]|| [[dropout | Summary]]
|-
|Nov 20 || Ali Mahdipour || || The human splicing code reveals new insights into the genetic determinants of disease ||[https://www.sciencemag.org/content/347/6218/1254806.full.pdf Paper] || [[Genetics | Summary]]
|-
|Nov 27 ||Mahmood Gohari || ||Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships ||[http://pubs.acs.org/doi/pdf/10.1021/ci500747n paper]|| [[Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships|Summary]]
|-
|Nov 27 || Derek Latremouille || ||Learning Fast Approximations of Sparse Coding || [http://yann.lecun.com/exdb/publis/pdf/gregor-icml-10.pdf Paper] ||[[Learning Fast Approximations of Sparse Coding|Summary]]
|-
|Nov 27 ||Xinran Liu || ||ImageNet Classification with Deep Convolutional Neural Networks ||[http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Paper]||[[ImageNet Classification with Deep Convolutional Neural Networks|Summary]]
|-
|TBA ||Ali Sarhadi|| ||Strategies for Training Large Scale Neural Network Language Models|| [http://www.msr-waypoint.com/pubs/175561/ASRU-2011.pdf Paper]||[[Strategies for Training Large Scale Neural Network Language Models|Summary]]
|-
|Nov 27 || Peter Blouw|| ||Memory Networks.|| [http://arxiv.org/abs/1410.3916]|| [[Memory Networks|Summary]]
|-
|Dec 4 || Chris Choi || || On the difficulty of training recurrent neural networks || [http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf Paper] || [[On the difficulty of training recurrent neural networks | Summary]]
|-
|Dec 4 || Fatemeh Karimi || ||MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION||[http://arxiv.org/pdf/1412.7755v2.pdf Paper]||[[MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION | Summary]]
|-
|Dec 4 || Jan Gosmann || || On the Number of Linear Regions of Deep Neural Networks || [http://arxiv.org/abs/1402.1869 Paper] || [[On the Number of Linear Regions of Deep Neural Networks | Summary]]
|-
|Dec 4 || Dylan Drover || 54 || Semi-supervised Learning with Deep Generative Models || [http://papers.nips.cc/paper/5352-semi-supervised-learning-with-deep-generative-models.pdf Paper] || [[Semi-supervised Learning with Deep Generative Models | Summary]]
|-
|}
|}

=Set B=

{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="100pt"|Name
|width="30pt"|Paper number
|width="400pt"|Title
|width="30pt"|Link to the paper
|width="30pt"|Link to the summary
|-
|Anthony Caterini ||1 ||The Manifold Tangent Classifier ||[http://papers.nips.cc/paper/4409-the-manifold-tangent-classifier.pdf Paper]|| [[The Manifold Tangent Classifier|Summary]]
|-
|Jan Gosmann ||2 || Neural Turing machines || [http://arxiv.org/abs/1410.5401 Paper] || [[Neural Turing Machines|Summary]]
|-
|Brent Komer ||3 || Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers || [http://arxiv.org/pdf/1202.2160v2.pdf Paper] || [[Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines|Summary]]
|-
|Sean Aubin ||4 || Deep Sparse Rectifier Neural Networks || [http://jmlr.csail.mit.edu/proceedings/papers/v15/glorot11a/glorot11a.pdf Paper] || [[Deep Sparse Rectifier Neural Networks|Summary]]
|-
|Peter Blouw||5 || Generating text with recurrent neural networks || [http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf Paper] || [[Generating text with recurrent neural networks|Summary]]
|-
|Tim Tse||6 || From Machine Learning to Machine Reasoning || [http://research.microsoft.com/pubs/206768/mlj-2013.pdf Paper] || [[From Machine Learning to Machine Reasoning | Summary ]]
|-
|Rui Qiao|| 7 || Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation || [http://arxiv.org/pdf/1406.1078v3.pdf Paper] || [[Learning Phrase Representations|Summary]]
|-
|Ftemeh Karimi|| 8 || Very Deep Convoloutional Networks for Large-Scale Image Recognition || [http://arxiv.org/pdf/1409.1556.pdf Paper] || [[Very Deep Convoloutional Networks for Large-Scale Image Recognition|Summary]]
|-
|Amirreza Lashkari|| 9 || Distributed Representations of Words and Phrases and their Compositionality || [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Paper] || [[Distributed Representations of Words and Phrases and their Compositionality|Summary]]
|-
|Xinran Liu|| 10 || Joint training of a convolutional network and a graphical model for human pose estimation || [http://papers.nips.cc/paper/5573-joint-training-of-a-convolutional-network-and-a-graphical-model-for-human-pose-estimation.pdf Paper] || [[Joint training of a convolutional network and a graphical model for human pose estimation|Summary]]
|-
|Chris Choi|| 11 || Learning Long-Range Vision for Autonomous Off-Road Driving || [http://yann.lecun.com/exdb/publis/pdf/hadsell-jfr-09.pdf Paper] || [[Learning Long-Range Vision for Autonomous Off-Road Driving|Summary]]
|-
|Luyao Ruan|| 12 || Deep Learning of the tissue-regulated splicing code || [http://bioinformatics.oxfordjournals.org/content/30/12/i121.full.pdf+html Paper] || [[Deep Learning of the tissue-regulated splicing code| Summary]]
|-
|Abdullah Rashwan|| 13 || Deep Convolutional Neural Networks For LVCSR || [http://www.cs.toronto.edu/~asamir/papers/icassp13_cnn.pdf paper] || [[Deep Convolutional Neural Networks For LVCSR| Summary]]
|-
|Mahmood Gohari|| 14 || On using very large target vocabulary for neural machine translation || [http://arxiv.org/pdf/1412.2007v2.pdf paper] || [[On using very large target vocabulary for neural machine translation| Summary]]
|-
|Valerie Platsko|| 15 || Learning Convolutional Feature Hierarchies for Visual Recognition || [http://papers.nips.cc/paper/4133-learning-convolutional-feature-hierarchies-for-visual-recognition Paper] || [[Learning Convolutional Feature Hierarchies for Visual Recognition | Summary]]
|-
|Derek Latremouille|| 16 || The Wake-Sleep Algorithm for Unsupervised Neural Networks || [http://www.gatsby.ucl.ac.uk/~dayan/papers/hdfn95.pdf Paper] || [[The Wake-Sleep Algorithm for Unsupervised Neural Networks | Summary]]
|-
|Ri Wang|| 17 || Continuous space language models || [https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenSemester2_2009_10/sdarticle.pdf Paper] || [[Continuous space language models | Summary]]
|-
|Deepak Rishi|| 18 || Extracting and Composing Robust Features with Denoising Autoencoders || [http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf Paper] || [[Extracting and Composing Robust Features with Denoising Autoencoders | Summary]]
|-
|Maysum Panju|| 19 || A fast learning algorithm for deep belief nets || [https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf Paper] || [[A fast learning algorithm for deep belief nets | Summary]]
|-
|Michael Hynes|| 20 || The loss surfaces of multilayer networks || [http://arxiv.org/abs/1412.0233 Paper] || [[The loss surfaces of multilayer networks (Choromanska et al.) | Summary]]
|-
|Dylan Drover|| 21 || Deep Generative Stochastic Networks Trainable by Backprop || [http://jmlr.org/proceedings/papers/v32/bengio14.pdf Paper] || [[Deep Generative Stochastic Networks Trainable by Backprop| Summary]]

strategies for Training Large Scale Neural Network Language Models

2015-11-26T23:01:15Z

Amirlk: /* References */

'''
== Introduction ==
'''
Statistical models of natural languages are a key part of many systems today. The most widely used known applications are automatic speech recognition, machine translation, and optical character recognition. In recent years language models, including Recurrent Neural Network and Maximum Entropy-based models have gained a lot of attention and are considered the most successful models. However, the main drawback of these models is their huge computation complexity.
This paper introduces a hash-based implementation of a class based maximum entropy model, that allows to easily control the trade-off between memory complexity and computational
complexity.
'''

== Motivation==
'''
As computational complexity is an issue for different types of deep neural network language models, this study briefly presents simple techniques that can be used to reduce computational cost of the training and test phases. The study also mentions that training neural network language models with maximum entropy models leads to better performance in terms of computational complexity.
The maximum entropymodel can be viewed as a Neural network model with no hidden layer with the input layer directly connected to the output
layer.

'''

== Model description==
The main difference between a neural network language model and Maximum entropy is that the features for NN LL
model are automatically learned as a function of the history. Also, the usual features for the ME model are binary, while
NN models use continuous-valued features. After the model is trained, similar words have similar
low-dimensional representations
'''
'''

== Recurrent Neural Network Models==
'''
The standard neural network language model has a very similar form to the maximum entropy model. The main difference is that the features for this model are automatically learned as a function of the history. Also, the usual features for the ME model are binary, while NN models use continuous-valued features. The NN LM as ca be described as:

<math>P(w|h)=\frac{e\sum_{k=1}^N \lambda_i f_i(s,w)} {\sum_{w=1} e \sum_{k=1}^N\lambda_i f_i(s,w)}</math>

where f is a set of feature, λ is a set of weights, and s is a state of the hidden layer. For the feedforward NN LM architecture, the state of the hidden layer depends on a projection layer, that is formed as a projection of N − 1 recent words into low-dimensional space. After the model is trained, similar words have similar low-dimensional representations. Alternatively, the state of hidden layer can depend on the most recent word and the state in the previous time step. Thus, the time is not represented explicitly. This recurrence allows the hidden layer to represent low-dimensional representation of the entire history (or in other words, it provides the model with a memory). The architecture is called the Recurrent neural network based language model (RNN LM)<ref name=MiT1>
Mikolov, Tomas, ''et al'' [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5947611"Extensions of recurrent neural network language model."] in Acoustics, Speech and Signal Processing (ICASSP), (2011).
</ref> <ref name=MiT2> Mikolov, Tomas, ''et al'' [http://www.fit.vutbr.cz/~imikolov/rnnlm/is2011_emp.pdf"“Empirical evaluation and combination of advanced language modeling techniques"] in Proceedings of Interspeech, (2010). </ref>.

[[File:Fig.jpg |center]]
Feedforward neural network 4-gram model (on the left) and Recurrent neural network language model (on the right)

'''

== Maximum Entropy model ==
'''
'''
A maximum entropy model has the following form:

<math>P(w|h)=\frac{e\sum_{k=1}^N \lambda_i f_i(h,w)} {\sum_{w=1} e \sum_{k=1}^N\lambda_i f_i(h,w)}</math>

where h is a history, f is the the set of features, which in maximum entropy case are n grams. The choice of features is usually done
manually, and significantly affects the overall performance of
the model. Training maximum entropy model consists of learning the set of weights λ.

'''

== Computational complexity ==
'''
'''
The training time of N-gram neural network language model is proportional to:

<math>I*W*((N-1) *D*H+H*V)</math>

where I is the number of training epochs before convergence is achieved, W is the number of tokens in the training set, N is the N-gram order, D is the dimensionality of words in the low-dimensional space, H is size of the hidden layer and V size of the vocabulary.

The recurrent NN LM has computational complexity as:

<math>I*W*(H*H+H*V)</math>

It can be seen that by increasing order N, the complexity of the feedforward architecture increases linearly, while it remains constant for the recurrent one.

The computational complexity in maximum entropy model is also described as follows:

<math>I*W*(N*V)</math>

The simple techniques used in the present study to reduce the computational complexity are:

'''
== A. Reduction of training epochs==
'''
'''
Training is usually performed by stochastic gradient descent, and takes 10-50 training epocs to converge.
In this study, it is been demonstrated that good performance can be achieved while performing as few as 7 training epochs instead of using thousands of epochs. This is achieved by sorting the training data by complexity.

'''

== B. Reduction of number of training tokens==
'''
In a vast majority of cases, NN LMs for LVCSR tasks are
trained on 5-30M tokens. Although the subsampling trick can
be used to claim that the neural network model has seen all
training data at least once, simple subsampling techniques lead
to severe performance degradation, against a model that is
trained on all data

In this study, NN LMs are trained only on small part of data (which are in-domain corpora) plus some randomly subsampled part of out-of-domain data.

'''

== C. Reduction of vocabulary==
'''
One technique is to compute probability distribution
only for the top M words in the neural network model and for the
rest of the words use backoff n-gram probabilities. The list
of top M words is then called a shortlist. However, it was
shown in that this technique causes severe degradation of
performance for small values of M, and even with M = 2000,
the complexity of the H × V term is still significant.
Goodman’s trick can be used for speeding up the models in terms of vocabulary. Each word from the vocabulary is assigned to a class and only the probability distribution over classes is computed. As the number of classes can be very small (several hundreds),
this is a more effective solution than using shortlists, and
the performance degradation is smaller.

'''

== D. Reduction of size of the hidden layer==
'''

Another way to reduce H×V is to choose a small value of H. Some techniques with respect to the combination of NN model with other methods are introduced for choosing the proper size of the hidden layer.

'''
== E. Parallelization ==
'''

As the state of the hidden layer depends on the previous state, the recurrent networks are hard to be parallelized. One can parallelize just the computation between hidden and output layers. The other way is to parallelize the whole network by training from multiple points in the training data at the same time. However, parallelization is highly architecture-specific optimization problem. In the current study, this problem is dealt with algorithmic approaches for reducing computational complexity.

'''
== Automatic data selection and sorting==
'''

The full training set is divided into 560 equally-sized chunks, and the perplexity on the development data is computed on each chunk. The data chunks with perplexity above 600 are discarded to obtain the reduced sorted training set.

[[File:fig2.jpg | center]]
'''
== Experiment with large RNN models ==
'''

By training RNN model on the reduced sorted dataset and increasing the hidden layer, better results than baseline backoff model are obtained. However, the performance of RNN models is strongly correlated with the size of the hidden layer. Combining the RNN models with baseline 4-gram model and tuning the weights of individual models on the development set leads to quite impressive reduction of WER.

[[File:table.jpg | center]]

'''
== Hash-based implementation of class-based maximum entropy model ==
'''

The maximum entropy model can be seen in the context of neural network models as a weight matrix that directly connects the input and output layers. In the present study, direct connections are added to the class-based RNN architecture. Direct parameters are used to connect input and output layers, and input and class layers. This model is denoted as RNNME.

Using direct connections leads to problems in memory complexity. To avoid this problem, a hash function is used to map the huge sparse matrix into one dimensional array. Using the underlying method, the achieved perplexity is better than the baseline perplexity of the KN4 model. Even better results are gained after interpolation of both models, and using rescoring experiment.

'''
== References ==
'''
<references />

strategies for Training Large Scale Neural Network Language Models

2015-11-26T23:01:03Z

Amirlk: /* Recurrent Neural Network Models */

strategies for Training Large Scale Neural Network Language Models

2015-11-26T22:58:08Z

Amirlk: /* References */

'''
== Introduction ==
'''
Statistical models of natural languages are a key part of many systems today. The most widely used known applications are automatic speech recognition, machine translation, and optical character recognition. In recent years language models, including Recurrent Neural Network and Maximum Entropy-based models have gained a lot of attention and are considered the most successful models. However, the main drawback of these models is their huge computation complexity.
This paper introduces a hash-based implementation of a class based maximum entropy model, that allows to easily control the trade-off between memory complexity and computational
complexity.
'''

== Motivation==
'''
As computational complexity is an issue for different types of deep neural network language models, this study briefly presents simple techniques that can be used to reduce computational cost of the training and test phases. The study also mentions that training neural network language models with maximum entropy models leads to better performance in terms of computational complexity.
The maximum entropymodel can be viewed as a Neural network model with no hidden layer with the input layer directly connected to the output
layer.

'''

== Model description==
The main difference between a neural network language model and Maximum entropy is that the features for NN LL
model are automatically learned as a function of the history. Also, the usual features for the ME model are binary, while
NN models use continuous-valued features. After the model is trained, similar words have similar
low-dimensional representations
'''
'''

== Recurrent Neural Network Models==
'''
The standard neural network language model has a very similar form to the maximum entropy model. The main difference is that the features for this model are automatically learned as a function of the history. Also, the usual features for the ME model are binary, while NN models use continuous-valued features. The NN LM as ca be described as:

<math>P(w|h)=\frac{e\sum_{k=1}^N \lambda_i f_i(s,w)} {\sum_{w=1} e \sum_{k=1}^N\lambda_i f_i(s,w)}</math>

where f is a set of feature, λ is a set of weights, and s is a state of the hidden layer. For the feedforward NN LM architecture, the state of the hidden layer depends on a projection layer, that is formed as a projection of N − 1 recent words into low-dimensional space. After the model is trained, similar words have similar low-dimensional representations. Alternatively, the state of hidden layer can depend on the most recent word and the state in the previous time step. Thus, the time is not represented explicitly. This recurrence allows the hidden layer to represent low-dimensional representation of the entire history (or in other words, it provides the model with a memory). The architecture is called the Recurrent neural network based language model (RNN LM).<ref name=MiT2>
Mikolov, Tomas, ''et al'' [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5947611"Extensions of recurrent neural network language model."] in Acoustics, Speech and Signal Processing (ICASSP), (2011).
</ref>.

[[File:Fig.jpg |center]]
Feedforward neural network 4-gram model (on the left) and Recurrent neural network language model (on the right)

'''

== Maximum Entropy model ==
'''
'''
A maximum entropy model has the following form:

<math>P(w|h)=\frac{e\sum_{k=1}^N \lambda_i f_i(h,w)} {\sum_{w=1} e \sum_{k=1}^N\lambda_i f_i(h,w)}</math>

where h is a history, f is the the set of features, which in maximum entropy case are n grams. The choice of features is usually done
manually, and significantly affects the overall performance of
the model. Training maximum entropy model consists of learning the set of weights λ.

'''

== Computational complexity ==
'''
'''
The training time of N-gram neural network language model is proportional to:

<math>I*W*((N-1) *D*H+H*V)</math>

where I is the number of training epochs before convergence is achieved, W is the number of tokens in the training set, N is the N-gram order, D is the dimensionality of words in the low-dimensional space, H is size of the hidden layer and V size of the vocabulary.

The recurrent NN LM has computational complexity as:

<math>I*W*(H*H+H*V)</math>

It can be seen that by increasing order N, the complexity of the feedforward architecture increases linearly, while it remains constant for the recurrent one.

The computational complexity in maximum entropy model is also described as follows:

<math>I*W*(N*V)</math>

The simple techniques used in the present study to reduce the computational complexity are:

'''
== A. Reduction of training epochs==
'''
'''
Training is usually performed by stochastic gradient descent, and takes 10-50 training epocs to converge.
In this study, it is been demonstrated that good performance can be achieved while performing as few as 7 training epochs instead of using thousands of epochs. This is achieved by sorting the training data by complexity.

'''

== B. Reduction of number of training tokens==
'''
In a vast majority of cases, NN LMs for LVCSR tasks are
trained on 5-30M tokens. Although the subsampling trick can
be used to claim that the neural network model has seen all
training data at least once, simple subsampling techniques lead
to severe performance degradation, against a model that is
trained on all data

In this study, NN LMs are trained only on small part of data (which are in-domain corpora) plus some randomly subsampled part of out-of-domain data.

'''

== C. Reduction of vocabulary==
'''
One technique is to compute probability distribution
only for the top M words in the neural network model and for the
rest of the words use backoff n-gram probabilities. The list
of top M words is then called a shortlist. However, it was
shown in that this technique causes severe degradation of
performance for small values of M, and even with M = 2000,
the complexity of the H × V term is still significant.
Goodman’s trick can be used for speeding up the models in terms of vocabulary. Each word from the vocabulary is assigned to a class and only the probability distribution over classes is computed. As the number of classes can be very small (several hundreds),
this is a more effective solution than using shortlists, and
the performance degradation is smaller.

'''

== D. Reduction of size of the hidden layer==
'''

Another way to reduce H×V is to choose a small value of H. Some techniques with respect to the combination of NN model with other methods are introduced for choosing the proper size of the hidden layer.

'''
== E. Parallelization ==
'''

As the state of the hidden layer depends on the previous state, the recurrent networks are hard to be parallelized. One can parallelize just the computation between hidden and output layers. The other way is to parallelize the whole network by training from multiple points in the training data at the same time. However, parallelization is highly architecture-specific optimization problem. In the current study, this problem is dealt with algorithmic approaches for reducing computational complexity.

'''
== Automatic data selection and sorting==
'''

The full training set is divided into 560 equally-sized chunks, and the perplexity on the development data is computed on each chunk. The data chunks with perplexity above 600 are discarded to obtain the reduced sorted training set.

[[File:fig2.jpg | center]]
'''
== Experiment with large RNN models ==
'''

By training RNN model on the reduced sorted dataset and increasing the hidden layer, better results than baseline backoff model are obtained. However, the performance of RNN models is strongly correlated with the size of the hidden layer. Combining the RNN models with baseline 4-gram model and tuning the weights of individual models on the development set leads to quite impressive reduction of WER.

[[File:table.jpg | center]]

'''
== Hash-based implementation of class-based maximum entropy model ==
'''

The maximum entropy model can be seen in the context of neural network models as a weight matrix that directly connects the input and output layers. In the present study, direct connections are added to the class-based RNN architecture. Direct parameters are used to connect input and output layers, and input and class layers. This model is denoted as RNNME.

Using direct connections leads to problems in memory complexity. To avoid this problem, a hash function is used to map the huge sparse matrix into one dimensional array. Using the underlying method, the achieved perplexity is better than the baseline perplexity of the KN4 model. Even better results are gained after interpolation of both models, and using rescoring experiment.

'''
== References ==
'''
<references />

T. Mikolov, M. Karafia ́t, L. Burget, J. Cˇernocky ́, and S. Khudanpur, “Recurrent neural network based language model

strategies for Training Large Scale Neural Network Language Models

2015-11-26T22:56:44Z

Amirlk: /* Recurrent Neural Network Models */

'''
== Introduction ==
'''
Statistical models of natural languages are a key part of many systems today. The most widely used known applications are automatic speech recognition, machine translation, and optical character recognition. In recent years language models, including Recurrent Neural Network and Maximum Entropy-based models have gained a lot of attention and are considered the most successful models. However, the main drawback of these models is their huge computation complexity.
This paper introduces a hash-based implementation of a class based maximum entropy model, that allows to easily control the trade-off between memory complexity and computational
complexity.
'''

== Motivation==
'''
As computational complexity is an issue for different types of deep neural network language models, this study briefly presents simple techniques that can be used to reduce computational cost of the training and test phases. The study also mentions that training neural network language models with maximum entropy models leads to better performance in terms of computational complexity.
The maximum entropymodel can be viewed as a Neural network model with no hidden layer with the input layer directly connected to the output
layer.

'''

== Model description==
The main difference between a neural network language model and Maximum entropy is that the features for NN LL
model are automatically learned as a function of the history. Also, the usual features for the ME model are binary, while
NN models use continuous-valued features. After the model is trained, similar words have similar
low-dimensional representations
'''
'''

== Recurrent Neural Network Models==
'''
The standard neural network language model has a very similar form to the maximum entropy model. The main difference is that the features for this model are automatically learned as a function of the history. Also, the usual features for the ME model are binary, while NN models use continuous-valued features. The NN LM as ca be described as:

<math>P(w|h)=\frac{e\sum_{k=1}^N \lambda_i f_i(s,w)} {\sum_{w=1} e \sum_{k=1}^N\lambda_i f_i(s,w)}</math>

where f is a set of feature, λ is a set of weights, and s is a state of the hidden layer. For the feedforward NN LM architecture, the state of the hidden layer depends on a projection layer, that is formed as a projection of N − 1 recent words into low-dimensional space. After the model is trained, similar words have similar low-dimensional representations. Alternatively, the state of hidden layer can depend on the most recent word and the state in the previous time step. Thus, the time is not represented explicitly. This recurrence allows the hidden layer to represent low-dimensional representation of the entire history (or in other words, it provides the model with a memory). The architecture is called the Recurrent neural network based language model (RNN LM).<ref name=MiT2>
Mikolov, Tomas, ''et al'' [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5947611"Extensions of recurrent neural network language model."] in Acoustics, Speech and Signal Processing (ICASSP), (2011).
</ref>.

[[File:Fig.jpg |center]]
Feedforward neural network 4-gram model (on the left) and Recurrent neural network language model (on the right)

'''

== Maximum Entropy model ==
'''
'''
A maximum entropy model has the following form:

<math>P(w|h)=\frac{e\sum_{k=1}^N \lambda_i f_i(h,w)} {\sum_{w=1} e \sum_{k=1}^N\lambda_i f_i(h,w)}</math>

where h is a history, f is the the set of features, which in maximum entropy case are n grams. The choice of features is usually done
manually, and significantly affects the overall performance of
the model. Training maximum entropy model consists of learning the set of weights λ.

'''

== Computational complexity ==
'''
'''
The training time of N-gram neural network language model is proportional to:

<math>I*W*((N-1) *D*H+H*V)</math>

where I is the number of training epochs before convergence is achieved, W is the number of tokens in the training set, N is the N-gram order, D is the dimensionality of words in the low-dimensional space, H is size of the hidden layer and V size of the vocabulary.

The recurrent NN LM has computational complexity as:

<math>I*W*(H*H+H*V)</math>

It can be seen that by increasing order N, the complexity of the feedforward architecture increases linearly, while it remains constant for the recurrent one.

The computational complexity in maximum entropy model is also described as follows:

<math>I*W*(N*V)</math>

The simple techniques used in the present study to reduce the computational complexity are:

'''
== A. Reduction of training epochs==
'''
'''
Training is usually performed by stochastic gradient descent, and takes 10-50 training epocs to converge.
In this study, it is been demonstrated that good performance can be achieved while performing as few as 7 training epochs instead of using thousands of epochs. This is achieved by sorting the training data by complexity.

'''

== B. Reduction of number of training tokens==
'''
In a vast majority of cases, NN LMs for LVCSR tasks are
trained on 5-30M tokens. Although the subsampling trick can
be used to claim that the neural network model has seen all
training data at least once, simple subsampling techniques lead
to severe performance degradation, against a model that is
trained on all data

In this study, NN LMs are trained only on small part of data (which are in-domain corpora) plus some randomly subsampled part of out-of-domain data.

'''

== C. Reduction of vocabulary==
'''
One technique is to compute probability distribution
only for the top M words in the neural network model and for the
rest of the words use backoff n-gram probabilities. The list
of top M words is then called a shortlist. However, it was
shown in that this technique causes severe degradation of
performance for small values of M, and even with M = 2000,
the complexity of the H × V term is still significant.
Goodman’s trick can be used for speeding up the models in terms of vocabulary. Each word from the vocabulary is assigned to a class and only the probability distribution over classes is computed. As the number of classes can be very small (several hundreds),
this is a more effective solution than using shortlists, and
the performance degradation is smaller.

'''

== D. Reduction of size of the hidden layer==
'''

Another way to reduce H×V is to choose a small value of H. Some techniques with respect to the combination of NN model with other methods are introduced for choosing the proper size of the hidden layer.

'''
== E. Parallelization ==
'''

As the state of the hidden layer depends on the previous state, the recurrent networks are hard to be parallelized. One can parallelize just the computation between hidden and output layers. The other way is to parallelize the whole network by training from multiple points in the training data at the same time. However, parallelization is highly architecture-specific optimization problem. In the current study, this problem is dealt with algorithmic approaches for reducing computational complexity.

'''
== Automatic data selection and sorting==
'''

The full training set is divided into 560 equally-sized chunks, and the perplexity on the development data is computed on each chunk. The data chunks with perplexity above 600 are discarded to obtain the reduced sorted training set.

[[File:fig2.jpg | center]]
'''
== Experiment with large RNN models ==
'''

By training RNN model on the reduced sorted dataset and increasing the hidden layer, better results than baseline backoff model are obtained. However, the performance of RNN models is strongly correlated with the size of the hidden layer. Combining the RNN models with baseline 4-gram model and tuning the weights of individual models on the development set leads to quite impressive reduction of WER.

[[File:table.jpg | center]]

'''
== Hash-based implementation of class-based maximum entropy model ==
'''

The maximum entropy model can be seen in the context of neural network models as a weight matrix that directly connects the input and output layers. In the present study, direct connections are added to the class-based RNN architecture. Direct parameters are used to connect input and output layers, and input and class layers. This model is denoted as RNNME.

Using direct connections leads to problems in memory complexity. To avoid this problem, a hash function is used to map the huge sparse matrix into one dimensional array. Using the underlying method, the achieved perplexity is better than the baseline perplexity of the KN4 model. Even better results are gained after interpolation of both models, and using rescoring experiment.

'''
== References ==
'''

T. Mikolov, S. Kombrink, L. Burget, J. Cernocky ́, and S. Khudanpur, “Extensions of recurrent neural network language model,” in Proceed- ings of ICASSP, 2011.

T. Mikolov, M. Karafia ́t, L. Burget, J. Cˇernocky ́, and S. Khudanpur, “Recurrent neural network based language model

strategies for Training Large Scale Neural Network Language Models

2015-11-26T22:53:36Z

Amirlk: /* Recurrent Neural Network Models */

'''
== Introduction ==
'''
Statistical models of natural languages are a key part of many systems today. The most widely used known applications are automatic speech recognition, machine translation, and optical character recognition. In recent years language models, including Recurrent Neural Network and Maximum Entropy-based models have gained a lot of attention and are considered the most successful models. However, the main drawback of these models is their huge computation complexity.
This paper introduces a hash-based implementation of a class based maximum entropy model, that allows to easily control the trade-off between memory complexity and computational
complexity.
'''

== Motivation==
'''
As computational complexity is an issue for different types of deep neural network language models, this study briefly presents simple techniques that can be used to reduce computational cost of the training and test phases. The study also mentions that training neural network language models with maximum entropy models leads to better performance in terms of computational complexity.
The maximum entropymodel can be viewed as a Neural network model with no hidden layer with the input layer directly connected to the output
layer.

'''

== Model description==
The main difference between a neural network language model and Maximum entropy is that the features for NN LL
model are automatically learned as a function of the history. Also, the usual features for the ME model are binary, while
NN models use continuous-valued features. After the model is trained, similar words have similar
low-dimensional representations
'''
'''

== Recurrent Neural Network Models==
'''
The standard neural network language model has a very similar form to the maximum entropy model. The main difference is that the features for this model are automatically learned as a function of the history. Also, the usual features for the ME model are binary, while NN models use continuous-valued features. The NN LM as ca be described as:

<math>P(w|h)=\frac{e\sum_{k=1}^N \lambda_i f_i(s,w)} {\sum_{w=1} e \sum_{k=1}^N\lambda_i f_i(s,w)}</math>

where f is a set of feature, λ is a set of weights, and s is a state of the hidden layer. For the feedforward NN LM architecture, the state of the hidden layer depends on a projection layer, that is formed as a projection of N − 1 recent words into low-dimensional space. After the model is trained, similar words have similar low-dimensional representations. Alternatively, the state of hidden layer can depend on the most recent word and the state in the previous time step. Thus, the time is not represented explicitly. This recurrence allows the hidden layer to represent low-dimensional representation of the entire history (or in other words, it provides the model with a memory). The architecture is called the Recurrent neural network based language model (RNN LM).

[[File:Fig.jpg |center]]
Feedforward neural network 4-gram model (on the left) and Recurrent neural network language model (on the right)

'''

== Maximum Entropy model ==
'''
'''
A maximum entropy model has the following form:

<math>P(w|h)=\frac{e\sum_{k=1}^N \lambda_i f_i(h,w)} {\sum_{w=1} e \sum_{k=1}^N\lambda_i f_i(h,w)}</math>

where h is a history, f is the the set of features, which in maximum entropy case are n grams. The choice of features is usually done
manually, and significantly affects the overall performance of
the model. Training maximum entropy model consists of learning the set of weights λ.

'''

== Computational complexity ==
'''
'''
The training time of N-gram neural network language model is proportional to:

<math>I*W*((N-1) *D*H+H*V)</math>

where I is the number of training epochs before convergence is achieved, W is the number of tokens in the training set, N is the N-gram order, D is the dimensionality of words in the low-dimensional space, H is size of the hidden layer and V size of the vocabulary.

The recurrent NN LM has computational complexity as:

<math>I*W*(H*H+H*V)</math>

It can be seen that by increasing order N, the complexity of the feedforward architecture increases linearly, while it remains constant for the recurrent one.

The computational complexity in maximum entropy model is also described as follows:

<math>I*W*(N*V)</math>

The simple techniques used in the present study to reduce the computational complexity are:

'''
== A. Reduction of training epochs==
'''
'''
Training is usually performed by stochastic gradient descent, and takes 10-50 training epocs to converge.
In this study, it is been demonstrated that good performance can be achieved while performing as few as 7 training epochs instead of using thousands of epochs. This is achieved by sorting the training data by complexity.

'''

== B. Reduction of number of training tokens==
'''
In a vast majority of cases, NN LMs for LVCSR tasks are
trained on 5-30M tokens. Although the subsampling trick can
be used to claim that the neural network model has seen all
training data at least once, simple subsampling techniques lead
to severe performance degradation, against a model that is
trained on all data

In this study, NN LMs are trained only on small part of data (which are in-domain corpora) plus some randomly subsampled part of out-of-domain data.

'''

== C. Reduction of vocabulary==
'''
One technique is to compute probability distribution
only for the top M words in the neural network model and for the
rest of the words use backoff n-gram probabilities. The list
of top M words is then called a shortlist. However, it was
shown in that this technique causes severe degradation of
performance for small values of M, and even with M = 2000,
the complexity of the H × V term is still significant.
Goodman’s trick can be used for speeding up the models in terms of vocabulary. Each word from the vocabulary is assigned to a class and only the probability distribution over classes is computed. As the number of classes can be very small (several hundreds),
this is a more effective solution than using shortlists, and
the performance degradation is smaller.

'''

== D. Reduction of size of the hidden layer==
'''

Another way to reduce H×V is to choose a small value of H. Some techniques with respect to the combination of NN model with other methods are introduced for choosing the proper size of the hidden layer.

'''
== E. Parallelization ==
'''

As the state of the hidden layer depends on the previous state, the recurrent networks are hard to be parallelized. One can parallelize just the computation between hidden and output layers. The other way is to parallelize the whole network by training from multiple points in the training data at the same time. However, parallelization is highly architecture-specific optimization problem. In the current study, this problem is dealt with algorithmic approaches for reducing computational complexity.

'''
== Automatic data selection and sorting==
'''

The full training set is divided into 560 equally-sized chunks, and the perplexity on the development data is computed on each chunk. The data chunks with perplexity above 600 are discarded to obtain the reduced sorted training set.

[[File:fig2.jpg | center]]
'''
== Experiment with large RNN models ==
'''

By training RNN model on the reduced sorted dataset and increasing the hidden layer, better results than baseline backoff model are obtained. However, the performance of RNN models is strongly correlated with the size of the hidden layer. Combining the RNN models with baseline 4-gram model and tuning the weights of individual models on the development set leads to quite impressive reduction of WER.

[[File:table.jpg | center]]

'''
== Hash-based implementation of class-based maximum entropy model ==
'''

The maximum entropy model can be seen in the context of neural network models as a weight matrix that directly connects the input and output layers. In the present study, direct connections are added to the class-based RNN architecture. Direct parameters are used to connect input and output layers, and input and class layers. This model is denoted as RNNME.

Using direct connections leads to problems in memory complexity. To avoid this problem, a hash function is used to map the huge sparse matrix into one dimensional array. Using the underlying method, the achieved perplexity is better than the baseline perplexity of the KN4 model. Even better results are gained after interpolation of both models, and using rescoring experiment.

'''
== References ==
'''

T. Mikolov, S. Kombrink, L. Burget, J. Cernocky ́, and S. Khudanpur, “Extensions of recurrent neural network language model,” in Proceed- ings of ICASSP, 2011.

T. Mikolov, M. Karafia ́t, L. Burget, J. Cˇernocky ́, and S. Khudanpur, “Recurrent neural network based language model

strategies for Training Large Scale Neural Network Language Models

2015-11-26T22:48:29Z

Amirlk: /* Recurrent Neural Network Models */

'''
== Introduction ==
'''
Statistical models of natural languages are a key part of many systems today. The most widely used known applications are automatic speech recognition, machine translation, and optical character recognition. In recent years language models, including Recurrent Neural Network and Maximum Entropy-based models have gained a lot of attention and are considered the most successful models. However, the main drawback of these models is their huge computation complexity.
This paper introduces a hash-based implementation of a class based maximum entropy model, that allows to easily control the trade-off between memory complexity and computational
complexity.
'''

== Motivation==
'''
As computational complexity is an issue for different types of deep neural network language models, this study briefly presents simple techniques that can be used to reduce computational cost of the training and test phases. The study also mentions that training neural network language models with maximum entropy models leads to better performance in terms of computational complexity.
The maximum entropymodel can be viewed as a Neural network model with no hidden layer with the input layer directly connected to the output
layer.

'''

== Model description==
The main difference between a neural network language model and Maximum entropy is that the features for NN LL
model are automatically learned as a function of the history. Also, the usual features for the ME model are binary, while
NN models use continuous-valued features. After the model is trained, similar words have similar
low-dimensional representations
'''
'''

== Recurrent Neural Network Models==
'''
The standard neural network language model has a very similar form to the maximum entropy model. The main difference is that the features for this model are automatically learned as a function of the history. Also, the usual features for the ME model are binary, while NN models use continuous-valued features. The NN LM as ca be described as:

<math>P(w|h)=\frac{e\sum_{k=1}^N \lambda_i f_i(s,w)} {\sum_{w=1} e \sum_{k=1}^N\lambda_i f_i(s,w)}</math>

where f is a set of feature, λ is a set of weights, and s is a state of the hidden layer. The state of hidden layer can depend on the most recent word and the state in the previous time step. This recurrence allows the hidden layer to represent low-dimensional representation of the entire history.

[[File:Fig.jpg |center]]
Feedforward neural network 4-gram model (on the left) and Recurrent neural network language model (on the right)

'''

== Maximum Entropy model ==
'''
'''
A maximum entropy model has the following form:

<math>P(w|h)=\frac{e\sum_{k=1}^N \lambda_i f_i(h,w)} {\sum_{w=1} e \sum_{k=1}^N\lambda_i f_i(h,w)}</math>

where h is a history, f is the the set of features, which in maximum entropy case are n grams. The choice of features is usually done
manually, and significantly affects the overall performance of
the model. Training maximum entropy model consists of learning the set of weights λ.

'''

== Computational complexity ==
'''
'''
The training time of N-gram neural network language model is proportional to:

<math>I*W*((N-1) *D*H+H*V)</math>

where I is the number of training epochs before convergence is achieved, W is the number of tokens in the training set, N is the N-gram order, D is the dimensionality of words in the low-dimensional space, H is size of the hidden layer and V size of the vocabulary.

The recurrent NN LM has computational complexity as:

<math>I*W*(H*H+H*V)</math>

It can be seen that by increasing order N, the complexity of the feedforward architecture increases linearly, while it remains constant for the recurrent one.

The computational complexity in maximum entropy model is also described as follows:

<math>I*W*(N*V)</math>

The simple techniques used in the present study to reduce the computational complexity are:

'''
== A. Reduction of training epochs==
'''
'''
Training is usually performed by stochastic gradient descent, and takes 10-50 training epocs to converge.
In this study, it is been demonstrated that good performance can be achieved while performing as few as 7 training epochs instead of using thousands of epochs. This is achieved by sorting the training data by complexity.

'''

== B. Reduction of number of training tokens==
'''
In a vast majority of cases, NN LMs for LVCSR tasks are
trained on 5-30M tokens. Although the subsampling trick can
be used to claim that the neural network model has seen all
training data at least once, simple subsampling techniques lead
to severe performance degradation, against a model that is
trained on all data

In this study, NN LMs are trained only on small part of data (which are in-domain corpora) plus some randomly subsampled part of out-of-domain data.

'''

== C. Reduction of vocabulary==
'''
One technique is to compute probability distribution
only for the top M words in the neural network model and for the
rest of the words use backoff n-gram probabilities. The list
of top M words is then called a shortlist. However, it was
shown in that this technique causes severe degradation of
performance for small values of M, and even with M = 2000,
the complexity of the H × V term is still significant.
Goodman’s trick can be used for speeding up the models in terms of vocabulary. Each word from the vocabulary is assigned to a class and only the probability distribution over classes is computed. As the number of classes can be very small (several hundreds),
this is a more effective solution than using shortlists, and
the performance degradation is smaller.

'''

== D. Reduction of size of the hidden layer==
'''

Another way to reduce H×V is to choose a small value of H. Some techniques with respect to the combination of NN model with other methods are introduced for choosing the proper size of the hidden layer.

'''
== E. Parallelization ==
'''

As the state of the hidden layer depends on the previous state, the recurrent networks are hard to be parallelized. One can parallelize just the computation between hidden and output layers. The other way is to parallelize the whole network by training from multiple points in the training data at the same time. However, parallelization is highly architecture-specific optimization problem. In the current study, this problem is dealt with algorithmic approaches for reducing computational complexity.

'''
== Automatic data selection and sorting==
'''

The full training set is divided into 560 equally-sized chunks, and the perplexity on the development data is computed on each chunk. The data chunks with perplexity above 600 are discarded to obtain the reduced sorted training set.

[[File:fig2.jpg | center]]
'''
== Experiment with large RNN models ==
'''

By training RNN model on the reduced sorted dataset and increasing the hidden layer, better results than baseline backoff model are obtained. However, the performance of RNN models is strongly correlated with the size of the hidden layer. Combining the RNN models with baseline 4-gram model and tuning the weights of individual models on the development set leads to quite impressive reduction of WER.

[[File:table.jpg | center]]

'''
== Hash-based implementation of class-based maximum entropy model ==
'''

The maximum entropy model can be seen in the context of neural network models as a weight matrix that directly connects the input and output layers. In the present study, direct connections are added to the class-based RNN architecture. Direct parameters are used to connect input and output layers, and input and class layers. This model is denoted as RNNME.

Using direct connections leads to problems in memory complexity. To avoid this problem, a hash function is used to map the huge sparse matrix into one dimensional array. Using the underlying method, the achieved perplexity is better than the baseline perplexity of the KN4 model. Even better results are gained after interpolation of both models, and using rescoring experiment.

'''
== References ==
'''

T. Mikolov, S. Kombrink, L. Burget, J. Cernocky ́, and S. Khudanpur, “Extensions of recurrent neural network language model,” in Proceed- ings of ICASSP, 2011.

T. Mikolov, M. Karafia ́t, L. Burget, J. Cˇernocky ́, and S. Khudanpur, “Recurrent neural network based language model

imageNet Classification with Deep Convolutional Neural Networks

2015-11-26T22:39:18Z

Amirlk: /* Overall Architecture */

== Introduction ==

In this paper, they trained a large, deep neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. To learn about thousands of objects from millions of images, Convolutional Neural Network (CNN) is utilized due to its large learning capacity, fewer connections and parameters and outstanding performance on image classification.

Moreover, current GPU provides a powerful tool to facilitate the training of interestingly-large CNNs. Thus, they trained one of the largest convolutional neural networks to date on the datasets of ILSVRC-2010 and ILSVRC-2012 and achieved the best results ever reported on these datasets by the time this paper was written.

The code of their work is available here<ref>
[http://code.google.com/p/cuda-convnet/ "High-performance C++/CUDA implementation of convolutional neural networks"]
</ref>.

== Dataset ==

ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has roughly 1.2 million labeled high-resolution training images, 50 thousand validation images, and 150 thousand testing images over 1000 categories.

In this paper, the images in this dataset are down-sampled to a fixed resolution of 256 x 256. The only image pre-processing they used is subtracting the mean activity over the training set from each pixel.

== Architecture ==

=== ReLU Nonlinearity ===

Non-saturating nonlinearity ''f(x) = max(0,x)'' also known as Rectified Linear Units (ReLUs)<ref>
Nair V, Hinton G E. [http://machinelearning.wustl.edu/mlpapers/paper_files/icml2010_NairH10.pdf Rectified linear units improve restricted boltzmann machines.] Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010: 807-814.
</ref> is used as the nonlinearity function, which works several times faster than equivalents with those standard saturating neurons.Neural networks are usually ill-conditioned and they converge very slowly. By using nonlinearities such as rectifiers (maxpooling units), gradients flow along a few paths instead of all possible paths resulting to faster convergence. Thus, better performance can be achieved by reducing the training time for each epoch and training larger datasets to prevent overfitting.
Deep convolutional neural networks
with ReLUs train several times faster than their
equivalents with tanh units. The following figure illustrates this. The shows the number of iterations required
to reach 25% training error on the CIFAR-10
dataset for a particular four-layer convolutional network.

[[File:Fig1.png]]

A four-layer convolutional neural
network with ReLUs (solid line) reaches a 25%
training error rate on CIFAR-10 six times faster
than an equivalent network with tanh neurons
(dashed line). The learning rates for each network
were chosen independently to make training
as fast as possible. No regularization of
any kind was employed. The magnitude of the
effect demonstrated here varies with network
architecture, but networks with ReLUs consistently
learn several times faster than equivalents
with saturating neurons.

=== Training on Multiple GPUs ===

They spread the net across two GPUs by putting half of the kernels (or neurons) on each GPU and letting GPUs communicate only in certain layers. Choosing the pattern of connectivity could be a problem for cross-validation, so they tune the amount of communication precisely until it is an acceptable fraction of the amount of computation.

=== Local Response Normalization ===

ReLUs have the desirable property that they do not require input normalization to prevent them from saturating. However, they find that a local response normalization scheme after applying the ReLU nonlinearity can reduce their top-1 and top-5 error rates by 1.4% and 1.2%.

The response normalization is given by the expression

<math>b_{x,y}^{i}=a_{x,y}^{i}/\left ( k+\alpha \sum_{j=max\left ( 0,i-n/2 \right )}^{min\left ( N-1,i+n/2 \right )}\left ( a_{x,y}^{i} \right )^{2} \right )^{\beta }</math>

where the sum runs over n “adjacent” kernel maps at the same spatial position. This response normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels.

The constants k, n, α, and β are hyper-parameters whose values are determined using a validation set; k = 2, n = 5, α = 10−4 , and β = 0.75 were used in this research. This normalization was used after applying the ReLU nonlinearity in certain layers

=== Overlapping Pooling ===

Unlike traditional non-overlapping pooling, they use overlapping pooling throughout their network, with pooling window size z = 3 and stride s = 2. This scheme reduces their top-1 and top-5 error rates by 0.4% and 0.3% and makes the network more difficult to overfit.

=== Overall Architecture ===

[[File:network.JPG | center]]

As shown in the figure above, the net contains eight layers with 60 million parameters; the first five are convolutional and the remaining three are fully connected layers. The first convolutional layer filters the 224 × 224 × 3 input image with 96 kernels of size 11 × 11 × 3 with a stride of 4 pixels (this is the distance between the receptive field centers of neighboring neurons in a kernel map). The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 × 5 × 48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 × 3 × 256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth convolutional layer has 384 kernels of size 3 × 3 × 192, and the fifth convolutional layer has 256 kernels of size 3 × 3 × 192. The fully-connected layers have 4096 neurons each. The output of the last layer is fed to a 1000-way softmax. Their network maximizes the average across training cases of the log-probability of the correct label under the prediction distribution.

Response-normalization layers follow the first and second convolutional layers. Max-pooling layers follow both response-normalization layers as well as the fifth convolutional layer. The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.

== Reducing overfitting ==

=== Data Augmentation ===

The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations. In this paper, the transformed images are generated on CPU while GPU is training and do not need to be stored on disk.

The first form of data augmentation consists of generating image translations and horizontal reflections.
They extract a random 224 x 224 patches (and their horizontal reflections) from the 256 x 256 images and training the network on these extracted patches. They also perform principal components analysis (PCA) on the set of RGB pixel values. To each training image, multiples of the found principal components, with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a Gaussian with mean zero and standard deviation 0.1 are added.Therefore to each RGB image pixel the following quantity is added

[[File:Fig2.png]]

This scheme helps to capture the object identity invariant with respect to its intensity and color, which reduces the top-1 error rate by over 1%.

=== Dropout ===

The “dropout” technique is implemented in the first two fully-connected layers by setting to zero the output of each hidden neuron with probability 0.5. This scheme roughly doubles the number of iterations required to converge. However, it forces the network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.

== Details of leaning ==

They trained the network using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. The update rule for weight w was

<math>v_{i+1}:=0.9\cdot v_{i}-0.0005\cdot \epsilon \cdot w_{i}-\epsilon \cdot \left \langle \frac{\partial L}{\partial w}|_{w_{i}} \right \rangle_{D_{i}}</math>

<math>w_{i+1}:=w_{i}+v_{i+1}</math>

where <math>v</math> is the momentum variable, <math>\epsilon</math> is the learning rate which is adjusted manually throughout training. The weights in each layer are initialized from a zero-mean Gaussian distribution with standard deviation 0.01. The biases in the second, fourth, fifth convolutional layers and fully-connected hidden layers are initialized by 1, while those in the remaining layers are set by 0. This initialization accelerates
the early stages of learning by providing the ReLUs with positive inputs. The neuron
biases in the remaining layers were initialized with the constant 0. Initializing the network with sparse weights is the other thing that reduces the ill-conditioning issue and helps this network work well.
An equal learning rate was used for all layers, which was adjusted manually throughout training.
The heuristic which was followed was to divide the learning rate by 10 when the validation error
rate stopped improving with the current learning rate. The learning rate was initialized at 0.01 and
6
reduced three times prior to termination. The network was trained for roughly 90 cycles through the
training set of 1.2 million images, which took five to six days on two NVIDIA GTX 580 3GB GPUs

== Results ==

For ILSVRC-2010 dataset, their network achieves top-1 and top-5 test set error rates of 37.5% and 17.0%, which was the state of the art at that time.

The following table shows the results

[[File:Tt1.png]]

Comparison of results on ILSVRC-
2010 test set. In italics are best results
achieved by others.

For LSVRC-2012 dataset, the CNN described in this paper achieves a top-5 error rate of 18.2%. Averaging the predictions of five similar CNNs gives an error rate of 16.4%. The following table summarizes the results for the LSVRC Dataset

[[File:Tt3.png]]

The following figure shows the learnt kernels

[[File:Figg3.png]]

96 convolutional kernels of size
11×11×3 learned by the first convolutional
layer on the 224×224×3 input images. The
top 48 kernels were learned on GPU 1 while
the bottom 48 kernels were learned on GPU
2. See Section 6.1 for details.

== Discussion ==

1. The main techniques that allowed this success include the following: efficient GPU training, number of labeled examples, convolutional architecture with max-pooling , rectifying non-linearities , careful initialization , careful parameter update and adaptive learning rate heuristics, layerwise feature normalization , and a dropout trick based on injecting strong binary multiplicative noise on hidden units.

2. It is notable that their network’s performance degrades if a single convolutional layer is removed. So the depth of the network is important for achieving their results.

3. Their experiments suggest that the results can be improved simply by waiting for faster GPUs and bigger datasets to become available.

== Bibliography ==
<references />

imageNet Classification with Deep Convolutional Neural Networks

2015-11-26T21:39:58Z

Amirlk: /* Local Response Normalization */

== Introduction ==

In this paper, they trained a large, deep neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. To learn about thousands of objects from millions of images, Convolutional Neural Network (CNN) is utilized due to its large learning capacity, fewer connections and parameters and outstanding performance on image classification.

Moreover, current GPU provides a powerful tool to facilitate the training of interestingly-large CNNs. Thus, they trained one of the largest convolutional neural networks to date on the datasets of ILSVRC-2010 and ILSVRC-2012 and achieved the best results ever reported on these datasets by the time this paper was written.

The code of their work is available here<ref>
[http://code.google.com/p/cuda-convnet/ "High-performance C++/CUDA implementation of convolutional neural networks"]
</ref>.

== Dataset ==

ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has roughly 1.2 million labeled high-resolution training images, 50 thousand validation images, and 150 thousand testing images over 1000 categories.

In this paper, the images in this dataset are down-sampled to a fixed resolution of 256 x 256. The only image pre-processing they used is subtracting the mean activity over the training set from each pixel.

== Architecture ==

=== ReLU Nonlinearity ===

Non-saturating nonlinearity ''f(x) = max(0,x)'' also known as Rectified Linear Units (ReLUs)<ref>
Nair V, Hinton G E. [http://machinelearning.wustl.edu/mlpapers/paper_files/icml2010_NairH10.pdf Rectified linear units improve restricted boltzmann machines.] Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010: 807-814.
</ref> is used as the nonlinearity function, which works several times faster than equivalents with those standard saturating neurons.Neural networks are usually ill-conditioned and they converge very slowly. By using nonlinearities such as rectifiers (maxpooling units), gradients flow along a few paths instead of all possible paths resulting to faster convergence. Thus, better performance can be achieved by reducing the training time for each epoch and training larger datasets to prevent overfitting.
Deep convolutional neural networks
with ReLUs train several times faster than their
equivalents with tanh units. The following figure illustrates this. The shows the number of iterations required
to reach 25% training error on the CIFAR-10
dataset for a particular four-layer convolutional network.

[[File:Fig1.png]]

A four-layer convolutional neural
network with ReLUs (solid line) reaches a 25%
training error rate on CIFAR-10 six times faster
than an equivalent network with tanh neurons
(dashed line). The learning rates for each network
were chosen independently to make training
as fast as possible. No regularization of
any kind was employed. The magnitude of the
effect demonstrated here varies with network
architecture, but networks with ReLUs consistently
learn several times faster than equivalents
with saturating neurons.

=== Training on Multiple GPUs ===

They spread the net across two GPUs by putting half of the kernels (or neurons) on each GPU and letting GPUs communicate only in certain layers. Choosing the pattern of connectivity could be a problem for cross-validation, so they tune the amount of communication precisely until it is an acceptable fraction of the amount of computation.

=== Local Response Normalization ===

ReLUs have the desirable property that they do not require input normalization to prevent them from saturating. However, they find that a local response normalization scheme after applying the ReLU nonlinearity can reduce their top-1 and top-5 error rates by 1.4% and 1.2%.

The response normalization is given by the expression

<math>b_{x,y}^{i}=a_{x,y}^{i}/\left ( k+\alpha \sum_{j=max\left ( 0,i-n/2 \right )}^{min\left ( N-1,i+n/2 \right )}\left ( a_{x,y}^{i} \right )^{2} \right )^{\beta }</math>

where the sum runs over n “adjacent” kernel maps at the same spatial position. This response normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels.

The constants k, n, α, and β are hyper-parameters whose values are determined using a validation set; k = 2, n = 5, α = 10−4 , and β = 0.75 were used in this research. This normalization was used after applying the ReLU nonlinearity in certain layers

=== Overlapping Pooling ===

Unlike traditional non-overlapping pooling, they use overlapping pooling throughout their network, with pooling window size z = 3 and stride s = 2. This scheme reduces their top-1 and top-5 error rates by 0.4% and 0.3% and makes the network more difficult to overfit.

=== Overall Architecture ===

[[File:network.JPG | center]]

As shown in the figure above, the net contains eight layers with 60 million parameters; the first five are convolutional and the remaining three are fully connected layers. The output of the last layer is fed to a 1000-way softmax. Their network maximizes the average across training cases of the log-probability of the correct label under the prediction distribution.

Response-normalization layers follow the first and second convolutional layers. Max-pooling layers follow both response-normalization layers as well as the fifth convolutional layer. The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.

== Reducing overfitting ==

=== Data Augmentation ===

The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations. In this paper, the transformed images are generated on CPU while GPU is training and do not need to be stored on disk.

The first form of data augmentation consists of generating image translations and horizontal reflections.
They extract a random 224 x 224 patches (and their horizontal reflections) from the 256 x 256 images and training the network on these extracted patches. They also perform principal components analysis (PCA) on the set of RGB pixel values. To each training image, multiples of the found principal components, with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a Gaussian with mean zero and standard deviation 0.1 are added.Therefore to each RGB image pixel the following quantity is added

[[File:Fig2.png]]

This scheme helps to capture the object identity invariant with respect to its intensity and color, which reduces the top-1 error rate by over 1%.

=== Dropout ===

The “dropout” technique is implemented in the first two fully-connected layers by setting to zero the output of each hidden neuron with probability 0.5. This scheme roughly doubles the number of iterations required to converge. However, it forces the network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.

== Details of leaning ==

They trained the network using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. The update rule for weight w was

<math>v_{i+1}:=0.9\cdot v_{i}-0.0005\cdot \epsilon \cdot w_{i}-\epsilon \cdot \left \langle \frac{\partial L}{\partial w}|_{w_{i}} \right \rangle_{D_{i}}</math>

<math>w_{i+1}:=w_{i}+v_{i+1}</math>

where <math>v</math> is the momentum variable, <math>\epsilon</math> is the learning rate which is adjusted manually throughout training. The weights in each layer are initialized from a zero-mean Gaussian distribution with standard deviation 0.01. The biases in the second, fourth, fifth convolutional layers and fully-connected hidden layers are initialized by 1, while those in the remaining layers are set by 0. This initialization accelerates
the early stages of learning by providing the ReLUs with positive inputs. The neuron
biases in the remaining layers were initialized with the constant 0. Initializing the network with sparse weights is the other thing that reduces the ill-conditioning issue and helps this network work well.
An equal learning rate was used for all layers, which was adjusted manually throughout training.
The heuristic which was followed was to divide the learning rate by 10 when the validation error
rate stopped improving with the current learning rate. The learning rate was initialized at 0.01 and
6
reduced three times prior to termination. The network was trained for roughly 90 cycles through the
training set of 1.2 million images, which took five to six days on two NVIDIA GTX 580 3GB GPUs

== Results ==

For ILSVRC-2010 dataset, their network achieves top-1 and top-5 test set error rates of 37.5% and 17.0%, which was the state of the art at that time.

The following table shows the results

[[File:Tt1.png]]

Comparison of results on ILSVRC-
2010 test set. In italics are best results
achieved by others.

For LSVRC-2012 dataset, the CNN described in this paper achieves a top-5 error rate of 18.2%. Averaging the predictions of five similar CNNs gives an error rate of 16.4%. The following table summarizes the results for the LSVRC Dataset

[[File:Tt3.png]]

The following figure shows the learnt kernels

[[File:Figg3.png]]

96 convolutional kernels of size
11×11×3 learned by the first convolutional
layer on the 224×224×3 input images. The
top 48 kernels were learned on GPU 1 while
the bottom 48 kernels were learned on GPU
2. See Section 6.1 for details.

== Discussion ==

1. The main techniques that allowed this success include the following: efficient GPU training, number of labeled examples, convolutional architecture with max-pooling , rectifying non-linearities , careful initialization , careful parameter update and adaptive learning rate heuristics, layerwise feature normalization , and a dropout trick based on injecting strong binary multiplicative noise on hidden units.

2. It is notable that their network’s performance degrades if a single convolutional layer is removed. So the depth of the network is important for achieving their results.

3. Their experiments suggest that the results can be improved simply by waiting for faster GPUs and bigger datasets to become available.

== Bibliography ==
<references />

imageNet Classification with Deep Convolutional Neural Networks

2015-11-26T21:36:47Z

Amirlk: /* ReLU Nonlinearity */

== Introduction ==

In this paper, they trained a large, deep neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. To learn about thousands of objects from millions of images, Convolutional Neural Network (CNN) is utilized due to its large learning capacity, fewer connections and parameters and outstanding performance on image classification.

Moreover, current GPU provides a powerful tool to facilitate the training of interestingly-large CNNs. Thus, they trained one of the largest convolutional neural networks to date on the datasets of ILSVRC-2010 and ILSVRC-2012 and achieved the best results ever reported on these datasets by the time this paper was written.

The code of their work is available here<ref>
[http://code.google.com/p/cuda-convnet/ "High-performance C++/CUDA implementation of convolutional neural networks"]
</ref>.

== Dataset ==

ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has roughly 1.2 million labeled high-resolution training images, 50 thousand validation images, and 150 thousand testing images over 1000 categories.

In this paper, the images in this dataset are down-sampled to a fixed resolution of 256 x 256. The only image pre-processing they used is subtracting the mean activity over the training set from each pixel.

== Architecture ==

=== ReLU Nonlinearity ===

Non-saturating nonlinearity ''f(x) = max(0,x)'' also known as Rectified Linear Units (ReLUs)<ref>
Nair V, Hinton G E. [http://machinelearning.wustl.edu/mlpapers/paper_files/icml2010_NairH10.pdf Rectified linear units improve restricted boltzmann machines.] Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010: 807-814.
</ref> is used as the nonlinearity function, which works several times faster than equivalents with those standard saturating neurons.Neural networks are usually ill-conditioned and they converge very slowly. By using nonlinearities such as rectifiers (maxpooling units), gradients flow along a few paths instead of all possible paths resulting to faster convergence. Thus, better performance can be achieved by reducing the training time for each epoch and training larger datasets to prevent overfitting.
Deep convolutional neural networks
with ReLUs train several times faster than their
equivalents with tanh units. The following figure illustrates this. The shows the number of iterations required
to reach 25% training error on the CIFAR-10
dataset for a particular four-layer convolutional network.

[[File:Fig1.png]]

A four-layer convolutional neural
network with ReLUs (solid line) reaches a 25%
training error rate on CIFAR-10 six times faster
than an equivalent network with tanh neurons
(dashed line). The learning rates for each network
were chosen independently to make training
as fast as possible. No regularization of
any kind was employed. The magnitude of the
effect demonstrated here varies with network
architecture, but networks with ReLUs consistently
learn several times faster than equivalents
with saturating neurons.

=== Training on Multiple GPUs ===

They spread the net across two GPUs by putting half of the kernels (or neurons) on each GPU and letting GPUs communicate only in certain layers. Choosing the pattern of connectivity could be a problem for cross-validation, so they tune the amount of communication precisely until it is an acceptable fraction of the amount of computation.

=== Local Response Normalization ===

ReLUs have the desirable property that they do not require input normalization to prevent them from saturating. However, they find that a local response normalization scheme after applying the ReLU nonlinearity can reduce their top-1 and top-5 error rates by 1.4% and 1.2%.

The response normalization is given by the expression

<math>b_{x,y}^{i}=a_{x,y}^{i}/\left ( k+\alpha \sum_{j=max\left ( 0,i-n/2 \right )}^{min\left ( N-1,i+n/2 \right )}\left ( a_{x,y}^{i} \right )^{2} \right )^{\beta }</math>

where the sum runs over n “adjacent” kernel maps at the same spatial position. This response normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels.

=== Overlapping Pooling ===

Unlike traditional non-overlapping pooling, they use overlapping pooling throughout their network, with pooling window size z = 3 and stride s = 2. This scheme reduces their top-1 and top-5 error rates by 0.4% and 0.3% and makes the network more difficult to overfit.

=== Overall Architecture ===

[[File:network.JPG | center]]

As shown in the figure above, the net contains eight layers with 60 million parameters; the first five are convolutional and the remaining three are fully connected layers. The output of the last layer is fed to a 1000-way softmax. Their network maximizes the average across training cases of the log-probability of the correct label under the prediction distribution.

Response-normalization layers follow the first and second convolutional layers. Max-pooling layers follow both response-normalization layers as well as the fifth convolutional layer. The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.

== Reducing overfitting ==

=== Data Augmentation ===

The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations. In this paper, the transformed images are generated on CPU while GPU is training and do not need to be stored on disk.

The first form of data augmentation consists of generating image translations and horizontal reflections.
They extract a random 224 x 224 patches (and their horizontal reflections) from the 256 x 256 images and training the network on these extracted patches. They also perform principal components analysis (PCA) on the set of RGB pixel values. To each training image, multiples of the found principal components, with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a Gaussian with mean zero and standard deviation 0.1 are added.Therefore to each RGB image pixel the following quantity is added

[[File:Fig2.png]]

This scheme helps to capture the object identity invariant with respect to its intensity and color, which reduces the top-1 error rate by over 1%.

=== Dropout ===

The “dropout” technique is implemented in the first two fully-connected layers by setting to zero the output of each hidden neuron with probability 0.5. This scheme roughly doubles the number of iterations required to converge. However, it forces the network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.

== Details of leaning ==

They trained the network using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. The update rule for weight w was

<math>v_{i+1}:=0.9\cdot v_{i}-0.0005\cdot \epsilon \cdot w_{i}-\epsilon \cdot \left \langle \frac{\partial L}{\partial w}|_{w_{i}} \right \rangle_{D_{i}}</math>

<math>w_{i+1}:=w_{i}+v_{i+1}</math>

where <math>v</math> is the momentum variable, <math>\epsilon</math> is the learning rate which is adjusted manually throughout training. The weights in each layer are initialized from a zero-mean Gaussian distribution with standard deviation 0.01. The biases in the second, fourth, fifth convolutional layers and fully-connected hidden layers are initialized by 1, while those in the remaining layers are set by 0. This initialization accelerates
the early stages of learning by providing the ReLUs with positive inputs. The neuron
biases in the remaining layers were initialized with the constant 0. Initializing the network with sparse weights is the other thing that reduces the ill-conditioning issue and helps this network work well.
An equal learning rate was used for all layers, which was adjusted manually throughout training.
The heuristic which was followed was to divide the learning rate by 10 when the validation error
rate stopped improving with the current learning rate. The learning rate was initialized at 0.01 and
6
reduced three times prior to termination. The network was trained for roughly 90 cycles through the
training set of 1.2 million images, which took five to six days on two NVIDIA GTX 580 3GB GPUs

== Results ==

For ILSVRC-2010 dataset, their network achieves top-1 and top-5 test set error rates of 37.5% and 17.0%, which was the state of the art at that time.

The following table shows the results

[[File:Tt1.png]]

Comparison of results on ILSVRC-
2010 test set. In italics are best results
achieved by others.

For LSVRC-2012 dataset, the CNN described in this paper achieves a top-5 error rate of 18.2%. Averaging the predictions of five similar CNNs gives an error rate of 16.4%. The following table summarizes the results for the LSVRC Dataset

[[File:Tt3.png]]

The following figure shows the learnt kernels

[[File:Figg3.png]]

96 convolutional kernels of size
11×11×3 learned by the first convolutional
layer on the 224×224×3 input images. The
top 48 kernels were learned on GPU 1 while
the bottom 48 kernels were learned on GPU
2. See Section 6.1 for details.

== Discussion ==

1. The main techniques that allowed this success include the following: efficient GPU training, number of labeled examples, convolutional architecture with max-pooling , rectifying non-linearities , careful initialization , careful parameter update and adaptive learning rate heuristics, layerwise feature normalization , and a dropout trick based on injecting strong binary multiplicative noise on hidden units.

2. It is notable that their network’s performance degrades if a single convolutional layer is removed. So the depth of the network is important for achieving their results.

3. Their experiments suggest that the results can be improved simply by waiting for faster GPUs and bigger datasets to become available.

== Bibliography ==
<references />

learning Fast Approximations of Sparse Coding

2015-11-23T20:00:16Z

Amirlk: /* Pre-existing Approximations: Iterative Shrinkage Algorithms */

= Background =

In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space.

The introduction of a larger set of spanning vectors is a consequence of the desire to produce accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.

Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which utilizes these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.

= Review of Sparse Coding =

For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.

These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:

:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>,

where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>.

From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.

=Pre-existing Approximations: Iterative Shrinkage Algorithms=

Here baseline iterative shrinkage algorithms for finding sparse codes are introduced and explained. The ISTA and FISTA methods update the whole code vector in parallel, while the more efficient Coordinate Descent method (CoD) updates the components one at a time and carefully selects which component to update at each step.
Both methods refine the initial guess through a form of mutual inhibition between code component, and component-wise shrinkage.

==Iterative Shrinkage & Thresholding (ISTA)==

The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:

:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.

Here, <math> \, L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.

=== Fast ISTA ===

Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.

Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:

:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math>

In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.

== Coordinate Descent ==

Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent (CoD) adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.

The CoD algorithm is presented below:

<blockquote>
<math>\textbf{function} \, \textbf{CoD}\left(X, Z, W_d, S, \alpha\right)</math>
: <math>\textbf{Require:} \,S = I - W_d^T W_d</math>
: <math>\textbf{Initialize:} \,Z = 0; B = W_d^TX</math>
: <math> \textbf{repeat}</math>
:: <math>\bar{Z} = h_{\alpha}\left(B\right)</math>
:: <math> \,k = \mbox{ index of largest component of} \left|Z - \bar{Z}\right|</math>
:: <math> \forall j \in \left[1, m\right]: B_j = B_j + S_{jk}\left(\bar{Z}_k - Z_k\right)</math>
:: <math> Z_k = \bar{Z}_k</math>
: <math>\textbf{until}\,\text{change in}\,Z\,\text{is below a threshold}</math>
: <math> Z = h_{\alpha}\left(B\right)</math>
<math> \textbf{end} \, \textbf{function} </math>
</blockquote>

In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. This algorithm has a similar feedback concept to ISTA, but can it can expressed as a linear feedback operation with a very sparse matrix (since only one component is updated at a time). Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.

= Encoders for Sparse Code Approximation =

In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.

==A Simplistic Architecture and its Limitations==

The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.

Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.

== Learned ISTA & Learned Coordinate Descent ==

To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.

Before understanding the rationale behind this approach, we must recognize a few relevant values which are inherently fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away.

In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.

Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.

The algorithm for LCoD can be summarized as

[[File:Q12.png]]

= Empirical Performance =

Two sets of experiments were undertaken:

* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.
* The MNIST digits dataset was used in assessing whether improved error-rates in code-prediction yields superior performance in recognition tasks.

Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.

== Berkeley Image Database ==

From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent.

Figure 1 suggests that, for a small number of iterations, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA.

<center>
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]
</center>

Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their LCoD values prior to training.

<center>
[[File:LCOD.png |frame | center |Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400. Open Circles indicate that matrices have been initialized with their LCoD values prior to training ]]
</center>

== MNIST Digits ==

Here, the authors examined whether these coding approximations could be effectively applied in classification tasks. Evaluation was conducted using both the entire 28x28-pixel images to create 784-dimensional codes, as well as extracted 16x16-pixel patches for codes with 256 components. In addition to the finding that increasing the number of iterations reduced classification error across all procedures, it was observed that Learned Coordinate Descent approached the optimal error rate after only 10 iterations.

A complete feature vector consisted of 25 concatenated such vectors, extracted
from all 16 × 16 patches shifted by 3 pixels on the input.
The features were extracted for all digits using
CoD with exact inference, CoD with a fixed number of
iterations, and LCoD. Additionally a version of CoD
(denoted CoD’) used inference with a fixed number
of iterations during training of the filters, and used
the same number of iterations during test (same complexity
as LCoD). A logistic regression classifier was
trained on the features thereby obtained.

Classification errors on the test set are shown in the following figures . While the error rate decreases with the
number of iterations for all methods, the error rate
of LCoD with 10 iterations is very close to the optimal
(differences in error rates of less than 0.1% are
insignificant on MNIST)

[[File:T1.png]]

MNIST results with 784-D sparse codes

MNIST results with 25 256-D sparse codes extracted
from 16 × 16 patches every 3 pixels

[[File:T2.png]]

= References =
References
Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding
algorithm with application to waveletbased
image deblurring. ICASSP’09, pp. 693–696, 2009.
Chen, S.S., Donoho, D.L., and Saunders, M.A. Atomic
decomposition by basis pursuit. SIAM review, 43(1):
129–159, 2001.

Daubechies, I, Defrise, M., and De Mol, C. An iterative
thresholding algorithm for linear inverse problems with a
sparsity constraint. Comm. on Pure and Applied Mathematics,
57:1413–1457, 2004.

Donoho, D.L. and Elad, M. Optimally sparse representation
in general (nonorthogonal) dictionaries via ℓ
1 minimization.
PNAS, 100(5):2197–2202, 2003.

Elad, M. and Aharon, M. Image denoising via learned dictionaries
and sparse representation. In CVPR’06, 2006.
Hale, E.T., Yin, W., and Zhang, Y. Fixed-point continuation
for l1-minimization: Methodology and convergence.
SIAM J. on Optimization, 19:1107, 2008.
Hoyer, P. O. Non-negative matrix factorization with
sparseness constraints. JMLR, 5:1457–1469, 2004.
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun,
Y. What is the best multi-stage architecture for object
recognition? In ICCV’09. IEEE, 2009.

Kavukcuoglu, Koray, Ranzato, Marc’Aurelio, and LeCun,
Yann. Fast inference in sparse coding algorithms
with applications to object recognition. Technical Report
CBLL-TR-2008-12-01, Computational and Biological
Learning Lab, Courant Institute, NYU, 2008.

Lee, H., Battle, A., Raina, R., and Ng, A.Y. Efficient
sparse coding algorithms. In NIPS’06, 2006.

Lee, H., Chaitanya, E., and Ng, A. Y. Sparse deep belief
net model for visual area v2. In Advances in Neural
Information Processing Systems, 2007.

Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y. Convolutional
deep belief networks for scalable unsupervised
learning of hierarchical representations. In International
Conference on Machine Learning. ACM New York, 2009.
Li, Y. and Osher, S. Coordinate descent optimization for
l1 minimization with application to compressed sensing;
a greedy algorithm. Inverse Problems and Imaging, 3
(3):487–503, 2009.

Mairal, J., Elad, M., and Sapiro, G. Sparse representation
for color image restoration. IEEE T. Image Processing,
17(1):53–69, January 2008.

Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online
dictionary learning for sparse coding. In ICML’09, 2009.
Olshausen, B.A. and Field, D. Emergence of simple-cell
receptive field properties by learning a sparse code for
natural images. Nature, 381(6583):607–609, 1996.

Ranzato, M., Huang, F.-J., Boureau, Y.-L., and LeCun,
Y. Unsupervised learning of invariant feature hierarchies
with applications to object recognition. In CVPR’07.
IEEE, 2007a.

Ranzato, M.-A., Boureau, Y.-L., Chopra, S., and LeCun,
Y. A unified energy-based framework for unsupervised
learning. In AI-Stats’07, 2007b.

Rozell, C.J., Johnson, D.H, Baraniuk, R.G., and Olshausen,
B.A. Sparse coding via thresholding and local
competition in neural circuits. Neural Computation, 20:
2526–2563, 2008.

Vonesch, C. and Unser, M. A fast iterative thresholding algorithm
for wavelet-regularized deconvolution. In IEEE
ISBI, 2007.

Wu, T.T. and Lange, K. Coordinate descent algorithms
for lasso penalized regression. Ann. Appl. Stat, 2(1):
224–244, 2008.

Yang, Jianchao, Yu, Kai, Gong, Yihong, and Huang,
Thomas. Linear spatial pyramid matching using sparse
coding for image classification. In CVPR’09, 2009.
Yu, Kai, Zhang, Tong, and Gong, Yihong. Nonlinear learning
using local coordinate coding. In NIPS’09, 2009.

learning Fast Approximations of Sparse Coding

2015-11-23T19:59:36Z

Amirlk: /* Pre-existing Approximations: Iterative Shrinkage Algorithms */

= Background =

In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space.

The introduction of a larger set of spanning vectors is a consequence of the desire to produce accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.

Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which utilizes these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.

= Review of Sparse Coding =

For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.

These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:

:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>,

where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>.

From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.

=Pre-existing Approximations: Iterative Shrinkage Algorithms=

Here baseline iterative shrinkage algorithms for finding sparse codes are introduced and explained.

==Iterative Shrinkage & Thresholding (ISTA)==

The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:

:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.

Here, <math> \, L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.

=== Fast ISTA ===

Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.

Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:

:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math>

In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.

== Coordinate Descent ==

Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent (CoD) adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.

The CoD algorithm is presented below:

<blockquote>
<math>\textbf{function} \, \textbf{CoD}\left(X, Z, W_d, S, \alpha\right)</math>
: <math>\textbf{Require:} \,S = I - W_d^T W_d</math>
: <math>\textbf{Initialize:} \,Z = 0; B = W_d^TX</math>
: <math> \textbf{repeat}</math>
:: <math>\bar{Z} = h_{\alpha}\left(B\right)</math>
:: <math> \,k = \mbox{ index of largest component of} \left|Z - \bar{Z}\right|</math>
:: <math> \forall j \in \left[1, m\right]: B_j = B_j + S_{jk}\left(\bar{Z}_k - Z_k\right)</math>
:: <math> Z_k = \bar{Z}_k</math>
: <math>\textbf{until}\,\text{change in}\,Z\,\text{is below a threshold}</math>
: <math> Z = h_{\alpha}\left(B\right)</math>
<math> \textbf{end} \, \textbf{function} </math>
</blockquote>

In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. This algorithm has a similar feedback concept to ISTA, but can it can expressed as a linear feedback operation with a very sparse matrix (since only one component is updated at a time). Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.

= Encoders for Sparse Code Approximation =

In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.

==A Simplistic Architecture and its Limitations==

The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.

Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.

== Learned ISTA & Learned Coordinate Descent ==

To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.

Before understanding the rationale behind this approach, we must recognize a few relevant values which are inherently fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away.

In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.

Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.

The algorithm for LCoD can be summarized as

[[File:Q12.png]]

= Empirical Performance =

Two sets of experiments were undertaken:

* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.
* The MNIST digits dataset was used in assessing whether improved error-rates in code-prediction yields superior performance in recognition tasks.

Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.

== Berkeley Image Database ==

From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent.

Figure 1 suggests that, for a small number of iterations, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA.

<center>
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]
</center>

Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their LCoD values prior to training.

<center>
[[File:LCOD.png |frame | center |Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400. Open Circles indicate that matrices have been initialized with their LCoD values prior to training ]]
</center>

== MNIST Digits ==

Here, the authors examined whether these coding approximations could be effectively applied in classification tasks. Evaluation was conducted using both the entire 28x28-pixel images to create 784-dimensional codes, as well as extracted 16x16-pixel patches for codes with 256 components. In addition to the finding that increasing the number of iterations reduced classification error across all procedures, it was observed that Learned Coordinate Descent approached the optimal error rate after only 10 iterations.

A complete feature vector consisted of 25 concatenated such vectors, extracted
from all 16 × 16 patches shifted by 3 pixels on the input.
The features were extracted for all digits using
CoD with exact inference, CoD with a fixed number of
iterations, and LCoD. Additionally a version of CoD
(denoted CoD’) used inference with a fixed number
of iterations during training of the filters, and used
the same number of iterations during test (same complexity
as LCoD). A logistic regression classifier was
trained on the features thereby obtained.

Classification errors on the test set are shown in the following figures . While the error rate decreases with the
number of iterations for all methods, the error rate
of LCoD with 10 iterations is very close to the optimal
(differences in error rates of less than 0.1% are
insignificant on MNIST)

[[File:T1.png]]

MNIST results with 784-D sparse codes

MNIST results with 25 256-D sparse codes extracted
from 16 × 16 patches every 3 pixels

[[File:T2.png]]

= References =
References
Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding
algorithm with application to waveletbased
image deblurring. ICASSP’09, pp. 693–696, 2009.
Chen, S.S., Donoho, D.L., and Saunders, M.A. Atomic
decomposition by basis pursuit. SIAM review, 43(1):
129–159, 2001.

Daubechies, I, Defrise, M., and De Mol, C. An iterative
thresholding algorithm for linear inverse problems with a
sparsity constraint. Comm. on Pure and Applied Mathematics,
57:1413–1457, 2004.

Donoho, D.L. and Elad, M. Optimally sparse representation
in general (nonorthogonal) dictionaries via ℓ
1 minimization.
PNAS, 100(5):2197–2202, 2003.

Elad, M. and Aharon, M. Image denoising via learned dictionaries
and sparse representation. In CVPR’06, 2006.
Hale, E.T., Yin, W., and Zhang, Y. Fixed-point continuation
for l1-minimization: Methodology and convergence.
SIAM J. on Optimization, 19:1107, 2008.
Hoyer, P. O. Non-negative matrix factorization with
sparseness constraints. JMLR, 5:1457–1469, 2004.
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun,
Y. What is the best multi-stage architecture for object
recognition? In ICCV’09. IEEE, 2009.

Kavukcuoglu, Koray, Ranzato, Marc’Aurelio, and LeCun,
Yann. Fast inference in sparse coding algorithms
with applications to object recognition. Technical Report
CBLL-TR-2008-12-01, Computational and Biological
Learning Lab, Courant Institute, NYU, 2008.

Lee, H., Battle, A., Raina, R., and Ng, A.Y. Efficient
sparse coding algorithms. In NIPS’06, 2006.

Lee, H., Chaitanya, E., and Ng, A. Y. Sparse deep belief
net model for visual area v2. In Advances in Neural
Information Processing Systems, 2007.

Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y. Convolutional
deep belief networks for scalable unsupervised
learning of hierarchical representations. In International
Conference on Machine Learning. ACM New York, 2009.
Li, Y. and Osher, S. Coordinate descent optimization for
l1 minimization with application to compressed sensing;
a greedy algorithm. Inverse Problems and Imaging, 3
(3):487–503, 2009.

Mairal, J., Elad, M., and Sapiro, G. Sparse representation
for color image restoration. IEEE T. Image Processing,
17(1):53–69, January 2008.

Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online
dictionary learning for sparse coding. In ICML’09, 2009.
Olshausen, B.A. and Field, D. Emergence of simple-cell
receptive field properties by learning a sparse code for
natural images. Nature, 381(6583):607–609, 1996.

Ranzato, M., Huang, F.-J., Boureau, Y.-L., and LeCun,
Y. Unsupervised learning of invariant feature hierarchies
with applications to object recognition. In CVPR’07.
IEEE, 2007a.

Ranzato, M.-A., Boureau, Y.-L., Chopra, S., and LeCun,
Y. A unified energy-based framework for unsupervised
learning. In AI-Stats’07, 2007b.

Rozell, C.J., Johnson, D.H, Baraniuk, R.G., and Olshausen,
B.A. Sparse coding via thresholding and local
competition in neural circuits. Neural Computation, 20:
2526–2563, 2008.

Vonesch, C. and Unser, M. A fast iterative thresholding algorithm
for wavelet-regularized deconvolution. In IEEE
ISBI, 2007.

Wu, T.T. and Lange, K. Coordinate descent algorithms
for lasso penalized regression. Ann. Appl. Stat, 2(1):
224–244, 2008.

Yang, Jianchao, Yu, Kai, Gong, Yihong, and Huang,
Thomas. Linear spatial pyramid matching using sparse
coding for image classification. In CVPR’09, 2009.
Yu, Kai, Zhang, Tong, and Gong, Yihong. Nonlinear learning
using local coordinate coding. In NIPS’09, 2009.

deep Neural Nets as a Method for Quantitative Structure–Activity Relationships

2015-11-23T19:30:26Z

Amirlk: /* Results */

== Introduction ==
This abstract is a summary of the paper "Deep Neural Nets as a Method for Quantitative Structure−Activity Relationships" by Ma J. et al. <ref> Ma J, Sheridan R. et al. [ http://pubs.acs.org/doi/pdf/10.1021/ci500747n.pdf "QSAR deep nets"] Journal of Chemical Information and Modeling. 2015,55, 263-274</ref>. The paper presents the application of machine learning methods, specifically Deep Neural Networks <ref> Hinton, G. E.; Osindero, S.; Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation 2006, 18, 1527−1554</ref> and Random Forest models <ref> Breiman L. Random Forests, Machine Learning. 2001,45, 5-32</ref> in the field of pharmaceutical industry. To discover a drug, it is needed that the best combination of different chemical compounds with different molecular structure was selected in order to achieve the best biological activity. Currently the SAR (QSAR) models are routinely used for this purpose. Structure-Activity Relationship (SAR), or Quantified SAR, is an approach designed to find relationships between chemical structure and biological activity (or target property) of studied compounds. The SAR models are type of classification or regression models where the predictors consist of physio-chemical properties or theoretical molecular and the response variable could be a biological activity of the chemicals, such as concentration of a substance required to give a certain biological response. The basic idea behind these methods is that activity of molecules is reflected in their structure and same molecules have the same activity. So if we learn the activity of a set of molecules structures ( or combinations of molecules) then we can predict the activity of similar molecules. QSAR methods are particularly computer intensive or require the adjustment of many sensitive parameters to achieve good prediction.In this sense, the machine learning methods can be helpful and two of those methods: support vector machine (SVM) and random forest (RF) are commonly used <ref>Svetnik, V. et al.,[http://pubs.acs.org/doi/pdf/10.1021/ci034160g.pdf Random forest: a classification and regression tool for compound classification and QSAR modeling,J. Chem. Inf. Comput. Sci.
2003, 43, 1947−1958 </ref>. In this paper the authors investigate the prediction performance of DNN as a QSAR method and compare it with RF performance that is somehow considered as a gold standard in this field.

== Motivation ==
At the first stage of drug discovery there are a huge number of candidate compounds that can be combined to produce a new drug. This process may involve a large number of compounds (>100 000) and a large number of descriptors (several thousands) that have different biological activity. Predicting all biological activities for all compounds need a lot number of experiments. The in silico discovery and using the optimization algorithms can substantially reduce the experiment work that need to be done. It was hypothesized that DNN models outperform RF models.

== Methods ==
In order to compare the prediction performance of methods, DNN and RF fitted to 15 data sets from a pharmaceutical company, Merck. The smallest data set has 2092 molecules with 4596 unique AP, DP descriptors. Each molecule is represented by a list of features, i.e. “descriptors” in QSAR nomenclature. The descriptors are substructure descriptors (e.g., atom pairs (AP), MACCS keys, circular fingerprints, etc.) and donor-descriptors (DP). Both descriptors are of the following form:

atom type i − (distance in bonds) − atom type j

Where for AP, atom type includes the element, number of nonhydrogen neighbors, and number of pi electrons. For DP, atom type is one of seven (cation, anion, neutral donor, neutral acceptor, polar, hydrophobe, and other). A separate group of 15 different data sets as Additional Data Sets were used to validate the conclusions acquired from the Kaggle data sets. Each of these data sets was split into train and test set. The metric to evaluate prediction performance of methods is coefficient of determination (<math>R^2</math>).

To run a RF, 100 trees were generated with m/3 descriptors used at each branch-point, where m was the number of unique descriptors in the training set. The tree nodes with 5 or fewer molecules were not split further. The trees parallelized to run one tree per processor on a cluster to run larger data sets in a reasonable time.

The DNNs with input descriptors X of a molecule and output of the form <math>O=f(\sum_{i=1}^{N} w_ix_i+b)</math> were fitted to data sets. Since many different parameters, such as number of layers, neurons, influence the performance of a deep neural net, Ma and his colleagues did a sensitivity analysis. They trained 71 DNNs with different parameters for each set of data. the parameters that they were considered were parameters related to:

-Data (descriptor transformation: no transformation, logarithmic transformation, or binary transformation.

-Network architecture: number of hidden layers, number of neurons in each hidden layer.

-Activation functions: sigmoid or rectified linear unit.

-The DNN training strategy: single training set or joint from multiple sets, percentage of neurons to drop-out in each layer.

-The mini-batched stochastic gradient descent procedure in the BP algorithm: the minibatch size, number of epochs

-Control the gradient descent optimization procedure: learning rate, momentum strength, and weight cost strength.

In addition to the effect of these parameters on the DNN, the authors were interested in evaluating consistency of results for a diverse set of QSAR tasks. Due to time-consuming process of evaluating the effect of the large number of adjustable parameters, a reasonable number of parameter settings were selected by adjusting the values of one or two parameters at a time, and then calculate the <math>R^2</math> for DNNs trained with the selected parameter settings. These results allowed them to focus on a smaller number of parameters, and to finally generate a set of recommended values for all algorithmic parameters, which can lead to consistently good predictions.

== Results ==

For the first object of this paper that was comparing the performance of DNNs to Rf, over over 50 DNNs were trained using different parameter settings. These parameter settings were arbitrarily selected, but they attempted to cover a sufficient range of values for each adjustable parameter. Figure 1 shows the difference in <math>R^2</math> between DNNs and RF for each kaggle data set. Each column represents a QSAR data set, and each circle represents the improvement of a DNN over RF.

<center>
[[File: fig1.PNG | frame | center |Figure 1. Overall DNN vs RF using arbitrarily selected parameter values. Each column represents a QSAR data set, and each circle represents the
improvement, measured in <math>R^2</math>, of a DNN over RF ]]
</center>

comparing the performance of different models shows that even when the worst DNN parameter setting was used for each QSAR task, the average R2 would be degraded only from 0.423 to 0.412, merely a 2.6% reduction. These results suggest that DNNs can generally outperform RF( table below).

<center>
[[File: table1.PNG | frame | center |Table 1. comparing test <math>R^2</math> of different models ]]
</center>

The difference in <math>R^2</math> between DNN and RF by changing the the network architecture is shown in Figure 2. In order to limit the number of different parameter combinations they fixed the number of neurons in each hidden layer. Thirty two DNNs were trained for each data set by varying number of hidden layers and number of neurons in each layer while the other key adjustable parameters were kept unchanged. It is seen that when the number of hidden layers are two, having a small number of neurons in the layers degrade the predictive capability of DNNs. It can also be seen that, given any number of hidden layers, once the number of neurons per layer is sufficiently large, increasing the number of neurons further has only a marginal benefit. In Figure 2 we can see that the neural network with only one hidden layer and 12 neurons in each layer achieved the same average predictive capability as RF . This size of neural network is indeed comparable with that of the classical neural network used in QSAR.

<center>
[[File: fig2.PNG | frame | center |Figure 2. Impacts of Network Architecture. Each marker in the plot represents a choice of DNN network architecture. The markers share the same number of hidden layers are connected with a line. The measurement (i.e., y-axis) is the difference of the mean R2 between DNNs and RF. ]]
</center>

To decide which activation function, Sigmoid or ReLU, performs better, at least 15 pairs of DNNs were trained for each data set. Each pair of DNNs shared the same adjustable parameter settings, except that one DNN used ReLU as the activation function, while the other used Sigmoid function. The data sets where ReLU is significantly better than Sigmoid are colored in blue, and marked at the bottom with “+”s. The difference was tested by one-sample Wilcoxon test. In contrast, the data set where Sigmoid is significantly better than ReLU is colored in black, and marked at the bottom with “−”s( Figure 3). In 53.3% (8 out of 15) data sets, ReLU is statistically significantly better than Sigmoid. Overall ReLU improves the average <math>R^2</math> over Sigmoid by 0.016.

<center>
[[File: fig3.PNG | frame | center |Figure 3. Choice of activation functions. Each column represents a QSAR data set, and each circle represents the difference, measured in <math>R^2</math>, of a pair of
DNNs trained with ReLU and Sigmoid, respectively ]]
</center>

Figure 4 presents the difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets. Average over all data sets, there seems to joint DNN has a better performance rather single training. However, the size of the training sets plays a critical role on whether a joint DNN is beneficial. For the two very largest data sets (i.e., 3A4 and LOGD), the individual DNNs seem better, indicating that joint DNNs are more proper for not much large data sets.

<center>
[[File: fig4.PNG | frame | center |Figure 4. difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets ]]
</center>

The authors refine their selection of DNN adjustable parameters by studying the results of previous runs. They used the logarithmic transformation, two hidden layers, at least 250 hidden layers an activation function of ReLU. The results are shown in Figure 5. Comparison of these results with those in Figure 1 indicates that now there are 9 out of 15 data sets, whereDNNs outperforms RF even with the “worst” parameter setting, compared with 4 out of 15. The <math>R^2</math> averaged over all DNNs and all 15 data sets is 0.051 higher than that of RF.

<center>
[[File: fig5.PNG | frame | center |Figure 5. DNN vs RF with refined parameter settings ]]
</center>

as a conclusion for the sensitivity analysis which had been done in this work, the authors gave a recommendation on the adjustable parameters of DNNs as below:
-logarithmic transformation.

-four hidden layers, with number of neurons to be 4000, 2000, 1000, and 1000, respectively.

-The dropout rates of 0 in the input layer, 25% in the first 3 hidden layer, and 10% in the last hidden layer.

-The activation function of ReLU.

-No unsupervised pretraining. The network parameters should be initialized as random values.

-Large number of epochs.

-Learning rate of 0.05, momentum strength of 0.9, and weight cost strength of 0.0001.

To check the consistency of DNNs predictions as was one of concerns of authors, they compared the performance of RF with DNN on 15 additional QSAR data sets. Each additional data set was time-split into training and test sets in the same way as the Kaggle data sets. Individual DNNs were trained from the training set using the recommended parameters.<math>R^2</math> of the DNN and RF were calculated from the test sets. Table below presents the results for the additional data sets. It is seen that the DNN with recommended parameters outperforms RF in 13
out of the 15 additional data sets. The mean <math>R^2</math> of DNNs is 0.411, while that of RFs is 0.361, which is an improvement of 13.9%.

<center>
[[File: table2.PNG | frame | center |Comparing RF with DNN trained using recommended parameter settings on 15 additional datasets]]
</center>

Both RF and DNN can be efficiently speeded up using high-performance computing technologies, but in a different way due to the inherent difference in their algorithms. RF can be accelerated using coarse parallelization on a cluster by giving one tree per node. In contrast, DNN can efficiently make use of the parallel computation capability of a modern GPU. With the dramatic advance in GPU hardware and increasing availability of GPU computing resources, DNN can become comparable, if not more advantageous, to RF in various aspects, including easy implementation, computation time, and hardware cost.

== Discussion ==
This paper demonstrate that DNN in most cases can be used as a practical QSAR method in place of RF which is now as a gold standard in the field of drug discovery. Although, the magnitude of the change in coefficient of determination relative to RF is small in some data sets, on average its better than RF. The paper recommends a set of values for all DNN algorithmic parameters, which are appropriate for large QSAR data sets in an industrial drug discovery environment. The authors gave some recommendation about how RF and DNN can be efficiently sped up using high performance computing technologies. They suggest that RF can be accelerated using coarse parallelization on a cluster by giving one tree per node. In contrast, DNN can efficiently make use of the parallel computation capability of a modern GPU.

== Future Works ==

In opposite of our expectation that unsupervised pretraining plays a critical role in the success of DNNs, in this study it had an inverse effect on the performance of QSAR tasks which need to be worked.
Although the paper had some recommendations about the adjustable parameters of DNNs, there is still need to develop an effective and efficient strategy for refining these parameters for each particular QSAR task.
The result of current paper suggested that cross-validation failed to be effective for fine-tuning the algorithmic parameters. Therefore, instead of using automatic methods for tuning DNN parameters, new approaches that can better indicate a DNN’s predictive capability in a time-split test set are needed to be developed.

== Bibliography ==
<references />

deep Neural Nets as a Method for Quantitative Structure–Activity Relationships

2015-11-23T19:29:31Z

Amirlk: /* Results */

== Introduction ==
This abstract is a summary of the paper "Deep Neural Nets as a Method for Quantitative Structure−Activity Relationships" by Ma J. et al. <ref> Ma J, Sheridan R. et al. [ http://pubs.acs.org/doi/pdf/10.1021/ci500747n.pdf "QSAR deep nets"] Journal of Chemical Information and Modeling. 2015,55, 263-274</ref>. The paper presents the application of machine learning methods, specifically Deep Neural Networks <ref> Hinton, G. E.; Osindero, S.; Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation 2006, 18, 1527−1554</ref> and Random Forest models <ref> Breiman L. Random Forests, Machine Learning. 2001,45, 5-32</ref> in the field of pharmaceutical industry. To discover a drug, it is needed that the best combination of different chemical compounds with different molecular structure was selected in order to achieve the best biological activity. Currently the SAR (QSAR) models are routinely used for this purpose. Structure-Activity Relationship (SAR), or Quantified SAR, is an approach designed to find relationships between chemical structure and biological activity (or target property) of studied compounds. The SAR models are type of classification or regression models where the predictors consist of physio-chemical properties or theoretical molecular and the response variable could be a biological activity of the chemicals, such as concentration of a substance required to give a certain biological response. The basic idea behind these methods is that activity of molecules is reflected in their structure and same molecules have the same activity. So if we learn the activity of a set of molecules structures ( or combinations of molecules) then we can predict the activity of similar molecules. QSAR methods are particularly computer intensive or require the adjustment of many sensitive parameters to achieve good prediction.In this sense, the machine learning methods can be helpful and two of those methods: support vector machine (SVM) and random forest (RF) are commonly used <ref>Svetnik, V. et al.,[http://pubs.acs.org/doi/pdf/10.1021/ci034160g.pdf Random forest: a classification and regression tool for compound classification and QSAR modeling,J. Chem. Inf. Comput. Sci.
2003, 43, 1947−1958 </ref>. In this paper the authors investigate the prediction performance of DNN as a QSAR method and compare it with RF performance that is somehow considered as a gold standard in this field.

== Motivation ==
At the first stage of drug discovery there are a huge number of candidate compounds that can be combined to produce a new drug. This process may involve a large number of compounds (>100 000) and a large number of descriptors (several thousands) that have different biological activity. Predicting all biological activities for all compounds need a lot number of experiments. The in silico discovery and using the optimization algorithms can substantially reduce the experiment work that need to be done. It was hypothesized that DNN models outperform RF models.

== Methods ==
In order to compare the prediction performance of methods, DNN and RF fitted to 15 data sets from a pharmaceutical company, Merck. The smallest data set has 2092 molecules with 4596 unique AP, DP descriptors. Each molecule is represented by a list of features, i.e. “descriptors” in QSAR nomenclature. The descriptors are substructure descriptors (e.g., atom pairs (AP), MACCS keys, circular fingerprints, etc.) and donor-descriptors (DP). Both descriptors are of the following form:

atom type i − (distance in bonds) − atom type j

Where for AP, atom type includes the element, number of nonhydrogen neighbors, and number of pi electrons. For DP, atom type is one of seven (cation, anion, neutral donor, neutral acceptor, polar, hydrophobe, and other). A separate group of 15 different data sets as Additional Data Sets were used to validate the conclusions acquired from the Kaggle data sets. Each of these data sets was split into train and test set. The metric to evaluate prediction performance of methods is coefficient of determination (<math>R^2</math>).

To run a RF, 100 trees were generated with m/3 descriptors used at each branch-point, where m was the number of unique descriptors in the training set. The tree nodes with 5 or fewer molecules were not split further. The trees parallelized to run one tree per processor on a cluster to run larger data sets in a reasonable time.

The DNNs with input descriptors X of a molecule and output of the form <math>O=f(\sum_{i=1}^{N} w_ix_i+b)</math> were fitted to data sets. Since many different parameters, such as number of layers, neurons, influence the performance of a deep neural net, Ma and his colleagues did a sensitivity analysis. They trained 71 DNNs with different parameters for each set of data. the parameters that they were considered were parameters related to:

-Data (descriptor transformation: no transformation, logarithmic transformation, or binary transformation.

-Network architecture: number of hidden layers, number of neurons in each hidden layer.

-Activation functions: sigmoid or rectified linear unit.

-The DNN training strategy: single training set or joint from multiple sets, percentage of neurons to drop-out in each layer.

-The mini-batched stochastic gradient descent procedure in the BP algorithm: the minibatch size, number of epochs

-Control the gradient descent optimization procedure: learning rate, momentum strength, and weight cost strength.

In addition to the effect of these parameters on the DNN, the authors were interested in evaluating consistency of results for a diverse set of QSAR tasks. Due to time-consuming process of evaluating the effect of the large number of adjustable parameters, a reasonable number of parameter settings were selected by adjusting the values of one or two parameters at a time, and then calculate the <math>R^2</math> for DNNs trained with the selected parameter settings. These results allowed them to focus on a smaller number of parameters, and to finally generate a set of recommended values for all algorithmic parameters, which can lead to consistently good predictions.

== Results ==

For the first object of this paper that was comparing the performance of DNNs to Rf, over over 50 DNNs were trained using different parameter settings. These parameter settings were arbitrarily selected, but they attempted to cover a sufficient range of values for each adjustable parameter. Figure 1 shows the difference in <math>R^2</math> between DNNs and RF for each kaggle data set. Each column represents a QSAR data set, and each circle represents the improvement of a DNN over RF.

<center>
[[File: fig1.PNG | frame | center |Figure 1. Overall DNN vs RF using arbitrarily selected parameter values. Each column represents a QSAR data set, and each circle represents the
improvement, measured in <math>R^2</math>, of a DNN over RF ]]
</center>

comparing the performance of different models shows that even when the worst DNN parameter setting was used for each QSAR task, the average R2 would be degraded only from 0.423 to 0.412, merely a 2.6% reduction. These results suggest that DNNs can generally outperform RF( table below).

<center>
[[File: table1.PNG | frame | center |Table 1. comparing test <math>R^2</math> of different models ]]
</center>

The difference in <math>R^2</math> between DNN and RF by changing the the network architecture is shown in Figure 2. In order to limit the number of different parameter combinations they fixed the number of neurons in each hidden layer. Thirty two DNNs were trained for each data set by varying number of hidden layers and number of neurons in each layer while the other key adjustable parameters were kept unchanged. It is seen that when the number of hidden layers are two, having a small number of neurons in the layers degrade the predictive capability of DNNs. It can also be seen that, given any number of hidden layers, once the number of neurons per layer is sufficiently large, increasing the number of neurons further has only a marginal benefit. In Figure 2 we can see that the neural network with only one hidden layer and 12 neurons in each layer achieved the same average predictive capability as RF . This size of neural network is indeed comparable with that of the classical neural network used in QSAR.

<center>
[[File: fig2.PNG | frame | center |Figure 2. Impacts of Network Architecture. Each marker in the plot represents a choice of DNN network architecture. The markers share the same number of hidden layers are connected with a line. The measurement (i.e., y-axis) is the difference of the mean R2 between DNNs and RF. ]]
</center>

To decide which activation function, Sigmoid or ReLU, performs better, at least 15 pairs of DNNs were trained for each data set. Each pair of DNNs shared the same adjustable parameter settings, except that one DNN used ReLU as the activation function, while the other used Sigmoid function. The data sets where ReLU is significantly better than Sigmoid are colored in blue, and marked at the bottom with “+”s. The difference was tested by one-sample Wilcoxon test. In contrast, the data set where Sigmoid is significantly better than ReLU is colored in black, and marked at the bottom with “−”s( Figure 3). In 53.3% (8 out of 15) data sets, ReLU is statistically significantly better than Sigmoid. Overall ReLU improves the average <math>R^2</math> over Sigmoid by 0.016.

<center>
[[File: fig3.PNG | frame | center |Figure 3. Choice of activation functions. Each column represents a QSAR data set, and each circle represents the difference, measured in <math>R^2</math>, of a pair of
DNNs trained with ReLU and Sigmoid, respectively ]]
</center>

Figure 4 presents the difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets. Average over all data sets, there seems to joint DNN has a better performance rather single training. However, the size of the training sets plays a critical role on whether a joint DNN is beneficial. For the two very largest data sets (i.e., 3A4 and LOGD), the individual DNNs seem better, indicating that joint DNNs are more proper for not much large data sets.

<center>
[[File: fig4.PNG | frame | center |Figure 4. difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets ]]
</center>

The authors refine their selection of DNN adjustable parameters by studying the results of previous runs. They used the logarithmic transformation, two hidden layers, at least 250 hidden layers an activation function of ReLU. The results are shown in Figure 5. Comparison of these results with those in Figure 1 indicates that now there are 9 out of 15 data sets, whereDNNs outperforms RF even with the “worst” parameter setting, compared with 4 out of 15. The <math>R^2</math> averaged over all DNNs and all 15 data sets is 0.051 higher than that of RF.

<center>
[[File: fig5.PNG | frame | center |Figure 5. DNN vs RF with refined parameter settings ]]
</center>

as a conclusion for the sensitivity analysis which had been done in this work, the authors gave a recommendation on the adjustable parameters of DNNs as below:
-logarithmic transformation.

-four hidden layers, with number of neurons to be 4000, 2000, 1000, and 1000, respectively.

-The dropout rates of 0 in the input layer, 25% in the first 3 hidden layer, and 10% in the last hidden layer.

-The activation function of ReLU.

-No unsupervised pretraining. The network parameters should be initialized as random values.

-Large number of epochs.

-Learning rate of 0.05, momentum strength of 0.9, and weight cost strength of 0.0001.

To check the consistency of DNNs predictions as was one of concerns of authors, they compared the performance of RF with DNN on 15 additional QSAR data sets. Each additional data set was time-split into training and test sets in the same way as the Kaggle data sets. Individual DNNs were trained from the training set using the recommended parameters.<math>R^2</math> of the DNN and RF were calculated from the test sets. Table below presents the results for the additional data sets. It is seen that the DNN with recommended parameters outperforms RF in 13
out of the 15 additional data sets. The mean <math>R^2</math> of DNNs is 0.411, while that of RFs is 0.361, which is an improvement of 13.9%.

<center>
[[File: table2.PNG | frame | center |Comparing RF with DNN trained using recommended parameter settings on 15 additional datasets]]
</center>

Both RF and DNN can be efficiently speeded up using high-performance computing technologies, but in a different way due to the inherent difference in their algorithms. RF can be accelerated using coarse parallelization on a cluster by giving one tree per node.

== Discussion ==
This paper demonstrate that DNN in most cases can be used as a practical QSAR method in place of RF which is now as a gold standard in the field of drug discovery. Although, the magnitude of the change in coefficient of determination relative to RF is small in some data sets, on average its better than RF. The paper recommends a set of values for all DNN algorithmic parameters, which are appropriate for large QSAR data sets in an industrial drug discovery environment. The authors gave some recommendation about how RF and DNN can be efficiently sped up using high performance computing technologies. They suggest that RF can be accelerated using coarse parallelization on a cluster by giving one tree per node. In contrast, DNN can efficiently make use of the parallel computation capability of a modern GPU.

== Future Works ==

In opposite of our expectation that unsupervised pretraining plays a critical role in the success of DNNs, in this study it had an inverse effect on the performance of QSAR tasks which need to be worked.
Although the paper had some recommendations about the adjustable parameters of DNNs, there is still need to develop an effective and efficient strategy for refining these parameters for each particular QSAR task.
The result of current paper suggested that cross-validation failed to be effective for fine-tuning the algorithmic parameters. Therefore, instead of using automatic methods for tuning DNN parameters, new approaches that can better indicate a DNN’s predictive capability in a time-split test set are needed to be developed.

== Bibliography ==
<references />

deep Neural Nets as a Method for Quantitative Structure–Activity Relationships

2015-11-23T19:29:02Z

Amirlk: /* Results */

== Introduction ==
This abstract is a summary of the paper "Deep Neural Nets as a Method for Quantitative Structure−Activity Relationships" by Ma J. et al. <ref> Ma J, Sheridan R. et al. [ http://pubs.acs.org/doi/pdf/10.1021/ci500747n.pdf "QSAR deep nets"] Journal of Chemical Information and Modeling. 2015,55, 263-274</ref>. The paper presents the application of machine learning methods, specifically Deep Neural Networks <ref> Hinton, G. E.; Osindero, S.; Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation 2006, 18, 1527−1554</ref> and Random Forest models <ref> Breiman L. Random Forests, Machine Learning. 2001,45, 5-32</ref> in the field of pharmaceutical industry. To discover a drug, it is needed that the best combination of different chemical compounds with different molecular structure was selected in order to achieve the best biological activity. Currently the SAR (QSAR) models are routinely used for this purpose. Structure-Activity Relationship (SAR), or Quantified SAR, is an approach designed to find relationships between chemical structure and biological activity (or target property) of studied compounds. The SAR models are type of classification or regression models where the predictors consist of physio-chemical properties or theoretical molecular and the response variable could be a biological activity of the chemicals, such as concentration of a substance required to give a certain biological response. The basic idea behind these methods is that activity of molecules is reflected in their structure and same molecules have the same activity. So if we learn the activity of a set of molecules structures ( or combinations of molecules) then we can predict the activity of similar molecules. QSAR methods are particularly computer intensive or require the adjustment of many sensitive parameters to achieve good prediction.In this sense, the machine learning methods can be helpful and two of those methods: support vector machine (SVM) and random forest (RF) are commonly used <ref>Svetnik, V. et al.,[http://pubs.acs.org/doi/pdf/10.1021/ci034160g.pdf Random forest: a classification and regression tool for compound classification and QSAR modeling,J. Chem. Inf. Comput. Sci.
2003, 43, 1947−1958 </ref>. In this paper the authors investigate the prediction performance of DNN as a QSAR method and compare it with RF performance that is somehow considered as a gold standard in this field.

== Motivation ==
At the first stage of drug discovery there are a huge number of candidate compounds that can be combined to produce a new drug. This process may involve a large number of compounds (>100 000) and a large number of descriptors (several thousands) that have different biological activity. Predicting all biological activities for all compounds need a lot number of experiments. The in silico discovery and using the optimization algorithms can substantially reduce the experiment work that need to be done. It was hypothesized that DNN models outperform RF models.

== Methods ==
In order to compare the prediction performance of methods, DNN and RF fitted to 15 data sets from a pharmaceutical company, Merck. The smallest data set has 2092 molecules with 4596 unique AP, DP descriptors. Each molecule is represented by a list of features, i.e. “descriptors” in QSAR nomenclature. The descriptors are substructure descriptors (e.g., atom pairs (AP), MACCS keys, circular fingerprints, etc.) and donor-descriptors (DP). Both descriptors are of the following form:

atom type i − (distance in bonds) − atom type j

Where for AP, atom type includes the element, number of nonhydrogen neighbors, and number of pi electrons. For DP, atom type is one of seven (cation, anion, neutral donor, neutral acceptor, polar, hydrophobe, and other). A separate group of 15 different data sets as Additional Data Sets were used to validate the conclusions acquired from the Kaggle data sets. Each of these data sets was split into train and test set. The metric to evaluate prediction performance of methods is coefficient of determination (<math>R^2</math>).

To run a RF, 100 trees were generated with m/3 descriptors used at each branch-point, where m was the number of unique descriptors in the training set. The tree nodes with 5 or fewer molecules were not split further. The trees parallelized to run one tree per processor on a cluster to run larger data sets in a reasonable time.

The DNNs with input descriptors X of a molecule and output of the form <math>O=f(\sum_{i=1}^{N} w_ix_i+b)</math> were fitted to data sets. Since many different parameters, such as number of layers, neurons, influence the performance of a deep neural net, Ma and his colleagues did a sensitivity analysis. They trained 71 DNNs with different parameters for each set of data. the parameters that they were considered were parameters related to:

-Data (descriptor transformation: no transformation, logarithmic transformation, or binary transformation.

-Network architecture: number of hidden layers, number of neurons in each hidden layer.

-Activation functions: sigmoid or rectified linear unit.

-The DNN training strategy: single training set or joint from multiple sets, percentage of neurons to drop-out in each layer.

-The mini-batched stochastic gradient descent procedure in the BP algorithm: the minibatch size, number of epochs

-Control the gradient descent optimization procedure: learning rate, momentum strength, and weight cost strength.

In addition to the effect of these parameters on the DNN, the authors were interested in evaluating consistency of results for a diverse set of QSAR tasks. Due to time-consuming process of evaluating the effect of the large number of adjustable parameters, a reasonable number of parameter settings were selected by adjusting the values of one or two parameters at a time, and then calculate the <math>R^2</math> for DNNs trained with the selected parameter settings. These results allowed them to focus on a smaller number of parameters, and to finally generate a set of recommended values for all algorithmic parameters, which can lead to consistently good predictions.

== Results ==

For the first object of this paper that was comparing the performance of DNNs to Rf, over over 50 DNNs were trained using different parameter settings. These parameter settings were arbitrarily selected, but they attempted to cover a sufficient range of values for each adjustable parameter. Figure 1 shows the difference in <math>R^2</math> between DNNs and RF for each kaggle data set. Each column represents a QSAR data set, and each circle represents the improvement of a DNN over RF.

<center>
[[File: fig1.PNG | frame | center |Figure 1. Overall DNN vs RF using arbitrarily selected parameter values. Each column represents a QSAR data set, and each circle represents the
improvement, measured in <math>R^2</math>, of a DNN over RF ]]
</center>

comparing the performance of different models shows that even when the worst DNN parameter setting was used for each QSAR task, the average R2 would be degraded only from 0.423 to 0.412, merely a 2.6% reduction. These results suggest that DNNs can generally outperform RF( table below).

<center>
[[File: table1.PNG | frame | center |Table 1. comparing test <math>R^2</math> of different models ]]
</center>

The difference in <math>R^2</math> between DNN and RF by changing the the network architecture is shown in Figure 2. In order to limit the number of different parameter combinations they fixed the number of neurons in each hidden layer. Thirty two DNNs were trained for each data set by varying number of hidden layers and number of neurons in each layer while the other key adjustable parameters were kept unchanged. It is seen that when the number of hidden layers are two, having a small number of neurons in the layers degrade the predictive capability of DNNs. It can also be seen that, given any number of hidden layers, once the number of neurons per layer is sufficiently large, increasing the number of neurons further has only a marginal benefit. In Figure 2 we can see that the neural network with only one hidden layer and 12 neurons in each layer achieved the same average predictive capability as RF . This size of neural network is indeed comparable with that of the classical neural network used in QSAR.

<center>
[[File: fig2.PNG | frame | center |Figure 2. Impacts of Network Architecture. Each marker in the plot represents a choice of DNN network architecture. The markers share the same number of hidden layers are connected with a line. The measurement (i.e., y-axis) is the difference of the mean R2 between DNNs and RF. ]]
</center>

To decide which activation function, Sigmoid or ReLU, performs better, at least 15 pairs of DNNs were trained for each data set. Each pair of DNNs shared the same adjustable parameter settings, except that one DNN used ReLU as the activation function, while the other used Sigmoid function. The data sets where ReLU is significantly better than Sigmoid are colored in blue, and marked at the bottom with “+”s. The difference was tested by one-sample Wilcoxon test. In contrast, the data set where Sigmoid is significantly better than ReLU is colored in black, and marked at the bottom with “−”s( Figure 3). In 53.3% (8 out of 15) data sets, ReLU is statistically significantly better than Sigmoid. Overall ReLU improves the average <math>R^2</math> over Sigmoid by 0.016.

<center>
[[File: fig3.PNG | frame | center |Figure 3. Choice of activation functions. Each column represents a QSAR data set, and each circle represents the difference, measured in <math>R^2</math>, of a pair of
DNNs trained with ReLU and Sigmoid, respectively ]]
</center>

Figure 4 presents the difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets. Average over all data sets, there seems to joint DNN has a better performance rather single training. However, the size of the training sets plays a critical role on whether a joint DNN is beneficial. For the two very largest data sets (i.e., 3A4 and LOGD), the individual DNNs seem better, indicating that joint DNNs are more proper for not much large data sets.

<center>
[[File: fig4.PNG | frame | center |Figure 4. difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets ]]
</center>

The authors refine their selection of DNN adjustable parameters by studying the results of previous runs. They used the logarithmic transformation, two hidden layers, at least 250 hidden layers an activation function of ReLU. The results are shown in Figure 5. Comparison of these results with those in Figure 1 indicates that now there are 9 out of 15 data sets, whereDNNs outperforms RF even with the “worst” parameter setting, compared with 4 out of 15. The <math>R^2</math> averaged over all DNNs and all 15 data sets is 0.051 higher than that of RF.

<center>
[[File: fig5.PNG | frame | center |Figure 5. DNN vs RF with refined parameter settings ]]
</center>

as a conclusion for the sensitivity analysis which had been done in this work, the authors gave a recommendation on the adjustable parameters of DNNs as below:
-logarithmic transformation.

-four hidden layers, with number of neurons to be 4000, 2000, 1000, and 1000, respectively.

-The dropout rates of 0 in the input layer, 25% in the first 3 hidden layer, and 10% in the last hidden layer.

-The activation function of ReLU.

-No unsupervised pretraining. The network parameters should be initialized as random values.

-Large number of epochs.

-Learning rate of 0.05, momentum strength of 0.9, and weight cost strength of 0.0001.

To check the consistency of DNNs predictions as was one of concerns of authors, they compared the performance of RF with DNN on 15 additional QSAR data sets. Each additional data set was time-split into training and test sets in the same way as the Kaggle data sets. Individual DNNs were trained from the training set using the recommended parameters.<math>R^2</math> of the DNN and RF were calculated from the test sets. Table below presents the results for the additional data sets. It is seen that the DNN with recommended parameters outperforms RF in 13
out of the 15 additional data sets. The mean <math>R^2</math> of DNNs is 0.411, while that of RFs is 0.361, which is an improvement of 13.9%.

<center>
[[File: table2.PNG | frame | center |Comparing RF with DNN trained using recommended parameter settings on 15 additional datasets]]
</center>

Both RF and DNN can be efficiently speeded up using high performance computing technologies, but in a different way due to the inherent difference in their algorithms.

== Discussion ==
This paper demonstrate that DNN in most cases can be used as a practical QSAR method in place of RF which is now as a gold standard in the field of drug discovery. Although, the magnitude of the change in coefficient of determination relative to RF is small in some data sets, on average its better than RF. The paper recommends a set of values for all DNN algorithmic parameters, which are appropriate for large QSAR data sets in an industrial drug discovery environment. The authors gave some recommendation about how RF and DNN can be efficiently sped up using high performance computing technologies. They suggest that RF can be accelerated using coarse parallelization on a cluster by giving one tree per node. In contrast, DNN can efficiently make use of the parallel computation capability of a modern GPU.

== Future Works ==

In opposite of our expectation that unsupervised pretraining plays a critical role in the success of DNNs, in this study it had an inverse effect on the performance of QSAR tasks which need to be worked.
Although the paper had some recommendations about the adjustable parameters of DNNs, there is still need to develop an effective and efficient strategy for refining these parameters for each particular QSAR task.
The result of current paper suggested that cross-validation failed to be effective for fine-tuning the algorithmic parameters. Therefore, instead of using automatic methods for tuning DNN parameters, new approaches that can better indicate a DNN’s predictive capability in a time-split test set are needed to be developed.

== Bibliography ==
<references />

f15Stat946PaperSignUp

2015-11-20T22:44:38Z

Amirlk: /* Set A */

=[https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/listofpapers1.pdf List of Papers]=

= Record your contributions [https://docs.google.com/spreadsheets/d/1A_0ej3S6ns3bBMwWLS4pwA6zDLz_0Ivwujj-d1Gr9eo/edit?usp=sharing here:]=

Use the following notations:

S: You have written a summary on the paper

T: You had technical contribution on a paper (excluding the paper that you present from set A or critique from set B)

E: You had editorial contribution on a paper (excluding the paper that you present from set A or critique from set B)

[http://goo.gl/forms/RASFRZXoxJ Your feedback on presentations]

=Set A=
{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="60pt"|Date
|width="100pt"|Name
|width="30pt"|Paper number
|width="400pt"|Title
|width="30pt"|Link to the paper
|width="30pt"|Link to the summary
|-
|Oct 16 || pascal poupart || || Guest Lecturer||||
|-
|Oct 16 ||pascal poupart || ||Guest Lecturer ||||
|-
|Oct 23 || Ali Ghodsi || || Lecturer||||
|-
|Oct 23 || Ali Ghodsi || || Lecturer||||
|-
|Oct 23 ||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]
|-
|Oct 23 || Deepak Rishi || || Parsing natural scenes and natural language with recursive neural networks || [http://www-nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf Paper] || [[Parsing natural scenes and natural language with recursive neural networks | Summary]]
|-
|Oct 30 || Ali Ghodsi || || Lecturer||||
|-
|Oct 30 || Ali Ghodsi || || Lecturer||||
|-
|Oct 30 ||Rui Qiao || ||Going deeper with convolutions || [http://arxiv.org/pdf/1409.4842v1.pdf Paper]|| [[GoingDeeperWithConvolutions|Summary]]
|-
|Oct 30 ||Amirreza Lashkari|| 21 ||Overfeat: integrated recognition, localization and detection using convolutional networks. || [http://arxiv.org/pdf/1312.6229v4.pdf Paper]|| [[Overfeat: integrated recognition, localization and detection using convolutional networks|Summary]]
|-
|Mkeup Class (TBA) || Peter Blouw|| ||Memory Networks.|| [http://arxiv.org/abs/1410.3916]|| [[Memory Networks|Summary]]
|-
|Nov 6 || Ali Ghodsi || || Lecturer||||
|-
|Nov 6 || Ali Ghodsi || || Lecturer||||
|-
|Nov 6 || Anthony Caterini ||56 || Human-level control through deep reinforcement learning ||[http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf Paper]|| [[Human-level control through deep reinforcement learning|Summary]]
|-
|Nov 6 || Sean Aubin || ||Learning Hierarchical Features for Scene Labeling ||[http://yann.lecun.com/exdb/publis/pdf/farabet-pami-13.pdf Paper]||[[Learning Hierarchical Features for Scene Labeling|Summary]]
|-
|Nov 13|| Mike Hynes || 12 ||Speech recognition with deep recurrent neural networks || [http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf Paper] || [[Graves et al., Speech recognition with deep recurrent neural networks|Summary]]
|-
|Nov 13 || Tim Tse || || Question Answering with Subgraph Embeddings || [http://arxiv.org/pdf/1406.3676v3.pdf Paper] || [[Question Answering with Subgraph Embeddings | Summary ]]
|-
|Nov 13 || Maysum Panju || ||Neural machine translation by jointly learning to align and translate ||[http://arxiv.org/pdf/1409.0473v6.pdf Paper] || [[Neural Machine Translation: Jointly Learning to Align and Translate|Summary]]
|-
|Nov 13 || Abdullah Rashwan || || Deep neural networks for acoustic modeling in speech recognition. ||[http://research.microsoft.com/pubs/171498/HintonDengYuEtAl-SPM2012.pdf paper]|| [[Deep neural networks for acoustic modeling in speech recognition| Summary]]
|-
|Nov 20 || Valerie Platsko || ||Natural language processing (almost) from scratch. ||[http://arxiv.org/pdf/1103.0398.pdf Paper]|| [[Natural language processing (almost) from scratch. | Summary]]
|-
|Nov 20 || Brent Komer || ||Show, Attend and Tell: Neural Image Caption Generation with Visual Attention || [http://arxiv.org/pdf/1502.03044v2.pdf Paper]||[[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention|Summary]]
|-
|Nov 20 || Luyao Ruan || || Dropout: A Simple Way to Prevent Neural Networks from Overfitting || [https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf Paper]|| [[dropout | Summary]]
|-
|Nov 20 || Ali Mahdipour || || The human splicing code reveals new insights into the genetic determinants of disease ||[https://www.sciencemag.org/content/347/6218/1254806.full.pdf Paper] || [[Genetics | Summary]]
|-
|Nov 27 ||Mahmood Gohari || ||Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships ||[http://pubs.acs.org/doi/pdf/10.1021/ci500747n paper]|| [[Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships|Summary]]
|-
|Nov 27 || Derek Latremouille || ||Learning Fast Approximations of Sparse Coding || [http://yann.lecun.com/exdb/publis/pdf/gregor-icml-10.pdf Paper] ||
|-
|Nov 27 ||Xinran Liu || ||ImageNet Classification with Deep Convolutional Neural Networks ||[http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Paper]||[[ImageNet Classification with Deep Convolutional Neural Networks|Summary]]
|-
|Nov 27 ||Ali Sarhadi|| ||Strategies for Training Large Scale Neural Network Language Models|| [http://www.msr-waypoint.com/pubs/175561/ASRU-2011.pdf Paper]||[[Strategies for Training Large Scale Neural Network Language Models|Summary]]
|-
|Dec 4 || Chris Choi || || On the difficulty of training recurrent neural networks || [http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf Paper] || [[On the difficulty of training recurrent neural networks | Summary]]
|-
|Dec 4 || Fatemeh Karimi || ||MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION||[http://arxiv.org/pdf/1412.7755v2.pdf Paper]||
|-
|Dec 4 || Jan Gosmann || || On the Number of Linear Regions of Deep Neural Networks || [http://arxiv.org/abs/1402.1869 Paper] || [[On the Number of Linear Regions of Deep Neural Networks | Summary]]
|-
|Dec 4 || Dylan Drover || 54 || Semi-supervised Learning with Deep Generative Models || [http://papers.nips.cc/paper/5352-semi-supervised-learning-with-deep-generative-models.pdf Paper] || [[Semi-supervised Learning with Deep Generative Models | Summary]]
|-
|}
|}

=Set B=

{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="100pt"|Name
|width="30pt"|Paper number
|width="400pt"|Title
|width="30pt"|Link to the paper
|width="30pt"|Link to the summary
|-
|Anthony Caterini ||15 ||The Manifold Tangent Classifier ||[http://papers.nips.cc/paper/4409-the-manifold-tangent-classifier.pdf Paper]|| [[The Manifold Tangent Classifier|Summary]]
|-
|Jan Gosmann || || Neural Turing machines || [http://arxiv.org/abs/1410.5401 Paper] || [[Neural Turing Machines|Summary]]
|-
|Brent Komer || || Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers || [http://arxiv.org/pdf/1202.2160v2.pdf Paper] || [[Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers Machines|Summary]]
|-
|Sean Aubin || || Deep Sparse Rectifier Neural Networks || [http://jmlr.csail.mit.edu/proceedings/papers/v15/glorot11a/glorot11a.pdf Paper] || [[Deep Sparse Rectifier Neural Networks|Summary]]
|-
|Peter Blouw|| || Generating text with recurrent neural networks || [http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf Paper] || [[Generating text with recurrent neural networks|Summary]]
|-
|Tim Tse|| || From Machine Learning to Machine Reasoning || [http://research.microsoft.com/pubs/206768/mlj-2013.pdf Paper] || [[From Machine Learning to Machine Reasoning | Summary ]]
|-
|Rui Qiao|| 40 || Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation || [http://arxiv.org/pdf/1406.1078v3.pdf Paper] || [[Learning Phrase Representations|Summary]]
|-
|Ftemeh Karimi|| 23 || Very Deep Convoloutional Networks for Large-Scale Image Recognition || [http://arxiv.org/pdf/1409.1556.pdf Paper] || [[Very Deep Convoloutional Networks for Large-Scale Image Recognition|Summary]]
|-
|Amirreza Lashkari|| 43 || Distributed Representations of Words and Phrases and their Compositionality || [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Paper] || [[Distributed Representations of Words and Phrases and their Compositionality|Summary]]
|-
|Xinran Liu|| 19 || Joint training of a convolutional network and a graphical model for human pose estimation || [http://papers.nips.cc/paper/5573-joint-training-of-a-convolutional-network-and-a-graphical-model-for-human-pose-estimation.pdf Paper] || [[Joint training of a convolutional network and a graphical model for human pose estimation|Summary]]
|-
|Chris Choi|| || Learning Long-Range Vision for Autonomous Off-Road Driving || [http://yann.lecun.com/exdb/publis/pdf/hadsell-jfr-09.pdf Paper] || [[Learning Long-Range Vision for Autonomous Off-Road Driving|Summary]]
|-
|Luyao Ruan|| || Deep Learning of the tissue-regulated splicing code || [http://bioinformatics.oxfordjournals.org/content/30/12/i121.full.pdf+html Paper] || [[Deep Learning of the tissue-regulated splicing code| Summary]]
|-
|Abdullah Rashwan|| || Deep Convolutional Neural Networks For LVCSR || [http://www.cs.toronto.edu/~asamir/papers/icassp13_cnn.pdf paper] || [[Deep Convolutional Neural Networks For LVCSR| Summary]]
|-
|Mahmood Gohari||37 || On using very large target vocabulary for neural machine translation || [http://arxiv.org/pdf/1412.2007v2.pdf paper] || [[On using very large target vocabulary for neural machine translation| Summary]]
|-
|Valerie Platsko|| || Learning Convolutional Feature Hierarchies for Visual Recognition || [http://papers.nips.cc/paper/4133-learning-convolutional-feature-hierarchies-for-visual-recognition Paper] || [[Learning Convolutional Feature Hierarchies for Visual Recognition | Summary]]
|-
|Derek Latremouille|| || The Wake-Sleep Algorithm for Unsupervised Neural Networks || [http://www.gatsby.ucl.ac.uk/~dayan/papers/hdfn95.pdf Paper] || [[The Wake-Sleep Algorithm for Unsupervised Neural Networks | Summary]]
|-
|Ri Wang|| || Continuous space language models || [https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenSemester2_2009_10/sdarticle.pdf Paper] || [[Continuous space language models | Summary]]
|-
|Deepak Rishi|| || Extracting and Composing Robust Features with Denoising Autoencoders || [http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf Paper] || [[Extracting and Composing Robust Features with Denoising Autoencoders | Summary]]
|-
|Maysum Panju|| || A fast learning algorithm for deep belief nets || [https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf Paper] || [[A fast learning algorithm for deep belief nets | Summary]]
|-
|Michael Hynes|| || The loss surfaces of multilayer networks || [http://arxiv.org/abs/1412.0233 Paper] || [[The loss surfaces of multilayer networks (Choromanska et al.) | Summary]]
|-
|Dylan Drover|| 53 || Deep Generative Stochastic Networks Trainable by Backprop || [http://jmlr.org/proceedings/papers/v32/bengio14.pdf Paper] || [[Deep Generative Stochastic Networks Trainable by Backprop| Summary]]

genetics

2015-11-20T04:00:40Z

Amirlk: /* Conclusion */

'''
== Genetic Application of Deep Learning ==
'''
This paper presentation is based on the paper [Hui Y. Xiong1 ''et al'', Science '''347''', 2015] which reveals the importance of deep learning methods in genetic study of disease while using different types of machine-learning approaches would enable us to precise annotation mechanism. These techniques have been done for a wide variety of disease including different cancers which has led to important achievements in mutation-driven splicing. t reach to this goal, various intronic and exonic disease mutations have taken into account to detect variants of mutations. This procedure should enable us to prognosis, diagnosis, and/or control a wide variety of diseases.

'''
== Introduction ==
'''
It has been a while since whole-genome sequencing been used to detect the source of disease or unwanted malignancies genetically. The idea is to find a hierarchy of mutations tending to such diseases by looking at alterations via genetic variations in the genome and particularly when they occur outside of those domains in which protein-coding happens. In the present paper, a computational method is given to detect those genetic variants which influence RNA splicing. RNA splicing is a modification of pre-messenger RNA (pre-mRNA) when introns are removed and makes the exons joined. Any type of interruptions on this important step of gene expression would lead to various kind of disease such as cancers and neurological disorders.

[[File:Stat1.jpg]]

'''

== Rationale ==
'''

Deep learning algorithm is used to construct a computational model in which DNA sequences are inputs to predict splicing in human textures. In this model, test variants up to 300 nucleotides into an intron, can then be used to derive a score for variant alterations for splicing. The model is not biased by existing disease annotations or population data and was derived in such a way that it can be used to study diverse diseases and disorders and to determine the consequences of common, rare, and even spontaneous variants.

[[File:Stat3.jpg]]

'''

== Materials and Methods ==
'''

The human splicing regulatory model is analyzed by Baysian machine learning method. 10,698 cassette exons has considered in this study as a training case. The goal is to maximize an information-theoretic code quality measure <math>CQ=\sum_e \sum_t D_{KL} (q_{t,e} | r_t ) - D_{KL} (q_{t,e} | p_{t,e} ) </math> where <math>q_{t,e}</math> is the target splicing pattern for exon in tissue t, <math> r_t </math> is the optimized guesser's prediction ignoring possible RNA features, <math>p_{t,e}</math> is the non-trained regulatory prediction on exons, and <math>D_{KL}</math> is the Kullback-Leibler between two distributions. CQ is, in fact, a likelihood function of <math>p_{t,e} </math>.

The structure of each model is a two-layer neural network of units which are sigmoidal hidden within a considered tissue. In our special case study, nonlinear and texture-dependent correlation between the RNA features and the splicing has considered. In such a model, RNA features provide the inputs to 30 hidden variables at most. Each hidden variable is a sigmoidal non-linearity of its corresponding input. Then by applying a softmax function, the non-linear hidden variable are used to prepare the prediction. Moreover, tissues are also trained jointly as disjoint output units.

Regarding the complexity of this approach, considering maximum likelihood learning method an overfitting is done for each model. The main learning algorithm applied in this paper are from <ref>
Xiong H.Y. ''et al'', Baysian Prediction of tissue-regulated splicing using RNA sequence and cellular context, Bioiformation 27, pp. 2554-2562, 2011.
</ref>. As a generalization of logistic regression, the multinomial regression model has considered linear in log odds ratio domain and without hidden variables. Then the model is trained by taking into account the same objective function, RNA features, splicing patterns, and partitioning the dataset as the Baysian neutral network described in above.

'''
== Experimental Validation ==
'''

To check the accuracy of the suggested splicing regulatory model, in this research, experimental results of several data bases are used including RNA-seq data, ET-PCR data, RNA binding protein affinity data, splicing factor knockdown data, and phenotypic/genotypic data.

[[File:Stat2.jpg]]
 
[[File:Stat6.jpg]]

'''

== Genome-wide Analysis ==
'''

As an important implications of genetic variation of splicing regulation, 658420 SNVs mapped to exonic and intronic sequences. Then the effect of each SNV on splicing regulation scored by applying the regulatory model of finding the largest value of the difference in predicted splicing level <math>\nabla \psi</math> across tissues.

[[File:Stat5.jpg]]

[[File:Stat8.jpg]]

'''

== Conclusion ==
'''

The method introduced in this paper represents a technique for disease-causing variants classification and for aberrant splicing malignancies. This computational model was trained to predict DNA sequence splicing in the absence of disease annotations or other existing population data and thus can be compared as a naive approach to the experimental data. Thus this model provides a method to understand the genetic basis of various diseases. This technique is able to accurately classify disease-causing variants and provides insights into the role of aberrant splicing in disease. This model predicts substantial and unexpected aberrant splicing due to variants within introns and exons, including those far from the splice site.
We know there are several practical considerations when using Bayes Neural networks. For instance difficulty to speed up and scale up to a large number of hidden variables because of relying on methods like MCMC it is. Leung et al <ref>
Leung M, Deep learning of the tissue-regulated splicing code, Bioiformatics 30, 2014.
</ref>. proposed an architecture that can have thousands of hidden units with multiple non-linear layers and millions of model parameters.

[[File:Stat7.jpg]]

'''

== References ==
'''

[1] Hui Y. Xiong1 ''et al'', The human splicing code reveals new insights into the genetic determinants of disease, Science '''347''', 2015.

[2] Xiong H.Y. ''et al'', Baysian Prediction of tissue-regulated splicing using RNA sequence and cellular context, Bioiformation '''27''', pp. 2554-2562, 2011.

genetics

2015-11-20T03:58:02Z

Amirlk: /* Rationale */

'''
== Genetic Application of Deep Learning ==
'''
This paper presentation is based on the paper [Hui Y. Xiong1 ''et al'', Science '''347''', 2015] which reveals the importance of deep learning methods in genetic study of disease while using different types of machine-learning approaches would enable us to precise annotation mechanism. These techniques have been done for a wide variety of disease including different cancers which has led to important achievements in mutation-driven splicing. t reach to this goal, various intronic and exonic disease mutations have taken into account to detect variants of mutations. This procedure should enable us to prognosis, diagnosis, and/or control a wide variety of diseases.

'''
== Introduction ==
'''
It has been a while since whole-genome sequencing been used to detect the source of disease or unwanted malignancies genetically. The idea is to find a hierarchy of mutations tending to such diseases by looking at alterations via genetic variations in the genome and particularly when they occur outside of those domains in which protein-coding happens. In the present paper, a computational method is given to detect those genetic variants which influence RNA splicing. RNA splicing is a modification of pre-messenger RNA (pre-mRNA) when introns are removed and makes the exons joined. Any type of interruptions on this important step of gene expression would lead to various kind of disease such as cancers and neurological disorders.

[[File:Stat1.jpg]]

'''

== Rationale ==
'''

Deep learning algorithm is used to construct a computational model in which DNA sequences are inputs to predict splicing in human textures. In this model, test variants up to 300 nucleotides into an intron, can then be used to derive a score for variant alterations for splicing. The model is not biased by existing disease annotations or population data and was derived in such a way that it can be used to study diverse diseases and disorders and to determine the consequences of common, rare, and even spontaneous variants.

[[File:Stat3.jpg]]

'''

== Materials and Methods ==
'''

The human splicing regulatory model is analyzed by Baysian machine learning method. 10,698 cassette exons has considered in this study as a training case. The goal is to maximize an information-theoretic code quality measure <math>CQ=\sum_e \sum_t D_{KL} (q_{t,e} | r_t ) - D_{KL} (q_{t,e} | p_{t,e} ) </math> where <math>q_{t,e}</math> is the target splicing pattern for exon in tissue t, <math> r_t </math> is the optimized guesser's prediction ignoring possible RNA features, <math>p_{t,e}</math> is the non-trained regulatory prediction on exons, and <math>D_{KL}</math> is the Kullback-Leibler between two distributions. CQ is, in fact, a likelihood function of <math>p_{t,e} </math>.

The structure of each model is a two-layer neural network of units which are sigmoidal hidden within a considered tissue. In our special case study, nonlinear and texture-dependent correlation between the RNA features and the splicing has considered. In such a model, RNA features provide the inputs to 30 hidden variables at most. Each hidden variable is a sigmoidal non-linearity of its corresponding input. Then by applying a softmax function, the non-linear hidden variable are used to prepare the prediction. Moreover, tissues are also trained jointly as disjoint output units.

Regarding the complexity of this approach, considering maximum likelihood learning method an overfitting is done for each model. The main learning algorithm applied in this paper are from <ref>
Xiong H.Y. ''et al'', Baysian Prediction of tissue-regulated splicing using RNA sequence and cellular context, Bioiformation 27, pp. 2554-2562, 2011.
</ref>. As a generalization of logistic regression, the multinomial regression model has considered linear in log odds ratio domain and without hidden variables. Then the model is trained by taking into account the same objective function, RNA features, splicing patterns, and partitioning the dataset as the Baysian neutral network described in above.

'''
== Experimental Validation ==
'''

To check the accuracy of the suggested splicing regulatory model, in this research, experimental results of several data bases are used including RNA-seq data, ET-PCR data, RNA binding protein affinity data, splicing factor knockdown data, and phenotypic/genotypic data.

[[File:Stat2.jpg]]
 
[[File:Stat6.jpg]]

'''

== Genome-wide Analysis ==
'''

As an important implications of genetic variation of splicing regulation, 658420 SNVs mapped to exonic and intronic sequences. Then the effect of each SNV on splicing regulation scored by applying the regulatory model of finding the largest value of the difference in predicted splicing level <math>\nabla \psi</math> across tissues.

[[File:Stat5.jpg]]

[[File:Stat8.jpg]]

'''

== Conclusion ==
'''

The method introduced in this paper represents a technique for disease-causing variants classification and for aberrant splicing malignancies. This computational model was trained to predict DNA sequence splicing in the absence of disease annotations or other existing population data and thus can be compared as a naive approach to the experimental data. Thus this model provides a method to understand the genetic basis of various diseases.
We know there are several practical considerations when using Bayes Neural networks. For instance difficulty to speed up and scale up to a large number of hidden variables because of relying on methods like MCMC it is. Leung et al <ref>
Leung M, Deep learning of the tissue-regulated splicing code, Bioiformatics 30, 2014.
</ref>. proposed an architecture that can have thousands of hidden units with multiple non-linear layers and millions of model parameters.

[[File:Stat7.jpg]]

'''
== References ==
'''

[1] Hui Y. Xiong1 ''et al'', The human splicing code reveals new insights into the genetic determinants of disease, Science '''347''', 2015.

[2] Xiong H.Y. ''et al'', Baysian Prediction of tissue-regulated splicing using RNA sequence and cellular context, Bioiformation '''27''', pp. 2554-2562, 2011.

dropout

2015-11-20T03:44:45Z

Amirlk: /* Model */

= Introduction =
Dropout<ref>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</ref> is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks.

[[File:intro.png]]

By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p can be set using a validation set, or can be set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).

= Model =

Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:

:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math>

:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product.

:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math>

:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math> , where <math> f </math> is the activation function.

For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.

'''Backpropagation in Dropout Case (Training)'''

Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.

Dropout can also be applied to finetune nets that have been pretrained using stacks of RBMs, autoencoders or Deep Boltzmann Machines. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. This is done by performing the regular pretraining methods (RBMs, autoencoders, ... etc). After pretraining, the weights are scaled up by factor <math> 1/p </math>, and then dropout finetuning is applied. The learning rate should be a smaller one to retain the information in the pretrained weights.

''' Max-norm Regularization '''

Using dropout along with max-norm regularization, large decaying learning rates and high momentum provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up.

'''Unsupervised Pretraining'''

Neural networks can be pretrained using stacks of RBMs<ref name=GeH>
Hinton, Geoffrey, ''et al'' [https://www.cs.toronto.edu/~hinton/science.pdf "Reducing the dimensionality of data with neural networks."] in Science,, (2006).
</ref>
, autoencoders<ref name=ViP>
Vincent, Pascal, ''et al'' [http://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf"Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion."] in Proceedings of the 27th International Conference on Machine Learning,, (2010).
</ref> or Deep Boltzmann Machines<ref name=SaR>
Salakhutdinov
, Ruslan, ''et al'' [http://www.utstat.toronto.edu/~rsalakhu/papers/dbm.pdf "Deep Boltzmann Machines
."] in Proceedings of the International Conference on Artificial Intelligence and Statistics(2009).
</ref>. Pretraining is an effective way of making use of unlabeled data. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. Dropout can be applied to finetune nets that have been pretrained using these techniques. The pretraining procedure stays the same.

'''Test Time'''

Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.

[[File:test.png]]

''' Multiplicative Gaussian Noise '''

Dropout takes Bernoulli distributed random variables which take the value 1 with probability <math>p </math> and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from <math>\mathcal{N}(1, 1) </math>. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation <math>h_i </math> is perturbed to <math>h_i+h_ir </math> where <math>r \sim \mathcal{N}(0,1) </math>, which equals to <math>h_ir' </math> where <math>r' \sim \mathcal{N}(1, 1) </math>. We can generalize this to <math>r' \sim \mathcal{N}(1, \sigma^2) </math> which <math>\sigma^2</math> is a hyperparameter to tune.

== Applying dropout to linear regression ==

Let <math>X \in \mathbb{R}^{N\times D}</math> be a data matrix of N data points. <math>\mathbf{y}\in \mathbb{R}^N</math> be a vector of targets.Linear regression tries to find a <math>\mathbf{w}\in \mathbb{R}^D</math> that maximizes <math>\parallel \mathbf{y}-X\mathbf{w}\parallel^2</math>.

When the input <math>X</math> is dropped out such that any input dimension is retained with probability <math>p</math>, the input can be expressed as <math>R*X</math> where <math>R\in \{0,1\}^{N\times D}</math> is a random matrix with <math>R_{ij}\sim Bernoulli(p)</math> and <math>*</math> denotes element-wise product. Marginalizing the noise, the objective function becomes

<math>\min_{\mathbf{w}} \mathbb{E}_{R\sim Bernoulli(p)}[\parallel \mathbf{y}-(R*X)\mathbf{w}\parallel^2 ]
</math>

This reduce to

<math>\min_{\mathbf{w}} \parallel \mathbf{y}-pX\mathbf{w}\parallel^2+p(1-p)\parallel \Gamma\mathbf{w}\parallel^2
</math>

where <math>\Gamma=(diag(X^TX))^{\frac{1}{2}}</math>. Therefore, dropout with linear regression is equivalent to ridge regression with a particular form for <math>\Gamma</math>. This form of <math>\Gamma</math> essentially scales the weight cost for weight <math>w_i</math> by the standard deviation of the <math>i</math>th dimension of the data. If a particular data dimension varies a lot, the regularizer tries to squeeze its weight more.

== Bayesian Neural Networks and Dropout ==

For some data set <math>\,{(x_i,y_i)}^n_{i=1}</math>, the Bayesian approach to estimating <math>\,y_{n+1}</math> given <math>\,x_{n+1}</math> is to pick some prior distribution, <math>\,P(\theta)</math>, and assign probabilities for <math>\,y_{n+1}</math> using the posterior distribution based on the prior distribution and the data set.

The general formula is:

<math>\,P(y_{n+1}|y_1,\dots,y_n,x_1,\dots,x_n,x_{n+1})=\int P(y_{n+1}|x_{n+1},\theta)P(\theta|y_1,\dots,y_n,x_1,\dots,x_n)d\theta</math>

To obtain a prediction, it is common to take the expected value of this distribution to get the formula:

<math>\,\hat y_{n+1}=\int y_{n+1}P(y_{n+1}|x_{n+1},\theta)P(\theta|y_1,\dots,y_n,x_1,\dots,x_n)d\theta</math>

This formula can be applied to a neural network by thinking of <math>\,\theta</math> as all of the parameters in the neural network and <math>\,P(y_{n+1}|x_{n+1},\theta)</math> can be thought as the output of the neural network given some set of weights and the input. Since the output of a neural network is fixed and the probability is 1 for a single output and 0 for all other possible outputs, the formula can be rewritten as:

<math>\,\hat y_{n+1}=\int f(x_{n+1},\theta)P(\theta|y_1,\dots,y_n,x_1,\dots,x_n)d\theta</math>

Where <math>\,f(x_{n+1},\theta)</math> is the output of the neural network given some weights and input. By taking a closer look at this expected values formula, it is essentially the average of infinitely many possible neural network outputs weighted by its probability of occurring given the data set.

In the dropout model, the researchers are doing something very similar in that they take the average of the outputs of a wide variety of models with different weights but unlike Bayesian neural networks where each of these outputs and their respective models are weighted by their proper probability of occurring, the dropout model assigns equal probability to each model. This necessarily impacts the accuracy of dropout neural networks compared to Bayesian neural networks but have very strong advantages in training speed and ability to scale.

Despite the erroneous probability weighting compared to Bayesian neural networks, the researchers compared the two models and found that while it is less accurate, it is still better than standard neural network models and can be seen in their chart below, higher is better:

[[File:BNN.PNG]]

= Effects of Dropout =

''' Effect on Features '''

In a standard neural network, units may change in a way that they fix up the mistakes of the other units, which may lead to complex co-adaptations and overfitting because these co-adaptations do not generalize to unseen data. Dropout breaks the co-adaptations between hidden units by making the presence of other units unreliable. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image.
[[File:feature.png]]

''' Effect on Sparsity '''

Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.
[[File:sparsity.png]]

'''Effect of Dropout Rate'''

The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:
1. The number of hidden units is held constant. (fixed n)
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal.
[[File:pvalue.png]]

'''Effect of Data Set Size'''

This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline.

[[File:Datasize.png]]

= Comparison =

The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:

[[File:Comparison.png]]

= Result =

The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate.

[[File:Result.png]]

In order to test the robustness of dropout, they did classification experiments with networks of many different architectures keeping all hyperparameters fixed. The figure below shows the test error rates obtained for these different architectures as training progresses. Dropout gives a huge improvement across all architectures.

[[File:dropout.PNG]]

The author also apply dropout scheme on many neural networks and test on different datasets, such as Street View House Numbers (SVHN), CIFAR, ImageNet and TIMIT dataset. Adding dropout can always reduce the error rate and further improve the performance of neural networks.

=Conclusion=

Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.

=Reference=
<references />

dropout

2015-11-20T03:36:01Z

Amirlk: /* Model */

= Introduction =
Dropout<ref>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</ref> is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks.

[[File:intro.png]]

By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p can be set using a validation set, or can be set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).

= Model =

Consider a neural network with <math>\ L </math> hidden layer. Let <math>\bold{z^{(l)}} </math> denote the vector inputs into layer <math> l </math>, <math>\bold{y}^{(l)} </math> denote the vector of outputs from layer <math> l </math>. <math>\ \bold{W}^{(l)} </math> and <math>\ \bold{b}^{(l)} </math> are the weights and biases at layer <math>l </math>. With dropout, the feed-forward operation becomes:

:::::::<math>\ r^{(l)}_j \sim Bernoulli(p) </math>

:::::::<math>\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}</math> , here * denotes an element-wise product.

:::::::<math>\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i </math>

:::::::<math>\ y^{(l+1)}_i=f(z^{(l+1)}_i) </math> , where <math> f </math> is the activation function.

For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.

'''Backpropagation in Dropout Case (Training)'''

Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.

Dropout can also be applied to finetune nets that have been pretrained using stacks of RBMs, autoencoders or Deep Boltzmann Machines. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. This is done by performing the regular pretraining methods (RBMs, autoencoders, ... etc). After pretraining, the weights are scaled up by factor <math> 1/p </math>, and then dropout finetuning is applied. The learning rate should be a smaller one to retain the information in the pretrained weights.

''' Max-norm Regularization '''

Using dropout along with max-norm regularization, large decaying learning rates and high momentum provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up.

'''Unsupervised Pretraining'''

Neural networks can be pretrained using stacks of RBMs (Hinton and Salakhutdinov, 2006), autoencoders (Vincent et al., 2010) or Deep Boltzmann Machines (Salakhutdinov and Hinton, 2009). Pretraining is an effective way of making use of unlabeled data. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. Dropout can be applied to finetune nets that have been pretrained using these techniques. The pretraining procedure stays the same.

'''Test Time'''

Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition.

[[File:test.png]]

''' Multiplicative Gaussian Noise '''

Dropout takes Bernoulli distributed random variables which take the value 1 with probability <math>p </math> and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from <math>\mathcal{N}(1, 1) </math>. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation <math>h_i </math> is perturbed to <math>h_i+h_ir </math> where <math>r \sim \mathcal{N}(0,1) </math>, which equals to <math>h_ir' </math> where <math>r' \sim \mathcal{N}(1, 1) </math>. We can generalize this to <math>r' \sim \mathcal{N}(1, \sigma^2) </math> which <math>\sigma^2</math> is a hyperparameter to tune.

== Applying dropout to linear regression ==

Let <math>X \in \mathbb{R}^{N\times D}</math> be a data matrix of N data points. <math>\mathbf{y}\in \mathbb{R}^N</math> be a vector of targets.Linear regression tries to find a <math>\mathbf{w}\in \mathbb{R}^D</math> that maximizes <math>\parallel \mathbf{y}-X\mathbf{w}\parallel^2</math>.

When the input <math>X</math> is dropped out such that any input dimension is retained with probability <math>p</math>, the input can be expressed as <math>R*X</math> where <math>R\in \{0,1\}^{N\times D}</math> is a random matrix with <math>R_{ij}\sim Bernoulli(p)</math> and <math>*</math> denotes element-wise product. Marginalizing the noise, the objective function becomes

<math>\min_{\mathbf{w}} \mathbb{E}_{R\sim Bernoulli(p)}[\parallel \mathbf{y}-(R*X)\mathbf{w}\parallel^2 ]
</math>

This reduce to

<math>\min_{\mathbf{w}} \parallel \mathbf{y}-pX\mathbf{w}\parallel^2+p(1-p)\parallel \Gamma\mathbf{w}\parallel^2
</math>

where <math>\Gamma=(diag(X^TX))^{\frac{1}{2}}</math>. Therefore, dropout with linear regression is equivalent to ridge regression with a particular form for <math>\Gamma</math>. This form of <math>\Gamma</math> essentially scales the weight cost for weight <math>w_i</math> by the standard deviation of the <math>i</math>th dimension of the data. If a particular data dimension varies a lot, the regularizer tries to squeeze its weight more.

== Bayesian Neural Networks and Dropout ==

For some data set <math>\,{(x_i,y_i)}^n_{i=1}</math>, the Bayesian approach to estimating <math>\,y_{n+1}</math> given <math>\,x_{n+1}</math> is to pick some prior distribution, <math>\,P(\theta)</math>, and assign probabilities for <math>\,y_{n+1}</math> using the posterior distribution based on the prior distribution and the data set.

The general formula is:

<math>\,P(y_{n+1}|y_1,\dots,y_n,x_1,\dots,x_n,x_{n+1})=\int P(y_{n+1}|x_{n+1},\theta)P(\theta|y_1,\dots,y_n,x_1,\dots,x_n)d\theta</math>

To obtain a prediction, it is common to take the expected value of this distribution to get the formula:

<math>\,\hat y_{n+1}=\int y_{n+1}P(y_{n+1}|x_{n+1},\theta)P(\theta|y_1,\dots,y_n,x_1,\dots,x_n)d\theta</math>

This formula can be applied to a neural network by thinking of <math>\,\theta</math> as all of the parameters in the neural network and <math>\,P(y_{n+1}|x_{n+1},\theta)</math> can be thought as the output of the neural network given some set of weights and the input. Since the output of a neural network is fixed and the probability is 1 for a single output and 0 for all other possible outputs, the formula can be rewritten as:

<math>\,\hat y_{n+1}=\int f(x_{n+1},\theta)P(\theta|y_1,\dots,y_n,x_1,\dots,x_n)d\theta</math>

Where <math>\,f(x_{n+1},\theta)</math> is the output of the neural network given some weights and input. By taking a closer look at this expected values formula, it is essentially the average of infinitely many possible neural network outputs weighted by its probability of occurring given the data set.

In the dropout model, the researchers are doing something very similar in that they take the average of the outputs of a wide variety of models with different weights but unlike Bayesian neural networks where each of these outputs and their respective models are weighted by their proper probability of occurring, the dropout model assigns equal probability to each model. This necessarily impacts the accuracy of dropout neural networks compared to Bayesian neural networks but have very strong advantages in training speed and ability to scale.

Despite the erroneous probability weighting compared to Bayesian neural networks, the researchers compared the two models and found that while it is less accurate, it is still better than standard neural network models and can be seen in their chart below, higher is better:

[[File:BNN.PNG]]

= Effects of Dropout =

''' Effect on Features '''

In a standard neural network, units may change in a way that they fix up the mistakes of the other units, which may lead to complex co-adaptations and overfitting because these co-adaptations do not generalize to unseen data. Dropout breaks the co-adaptations between hidden units by making the presence of other units unreliable. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image.
[[File:feature.png]]

''' Effect on Sparsity '''

Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.
[[File:sparsity.png]]

'''Effect of Dropout Rate'''

The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations:
1. The number of hidden units is held constant. (fixed n)
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> )
The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal.
[[File:pvalue.png]]

'''Effect of Data Set Size'''

This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline.

[[File:Datasize.png]]

= Comparison =

The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:

[[File:Comparison.png]]

= Result =

The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate.

[[File:Result.png]]

In order to test the robustness of dropout, they did classification experiments with networks of many different architectures keeping all hyperparameters fixed. The figure below shows the test error rates obtained for these different architectures as training progresses. Dropout gives a huge improvement across all architectures.

[[File:dropout.PNG]]

The author also apply dropout scheme on many neural networks and test on different datasets, such as Street View House Numbers (SVHN), CIFAR, ImageNet and TIMIT dataset. Adding dropout can always reduce the error rate and further improve the performance of neural networks.

=Conclusion=

Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.

=Reference=
<references />

show, Attend and Tell: Neural Image Caption Generation with Visual Attention

2015-11-20T03:22:43Z

Amirlk: /* Attention: Two Variants */

= Introduction =

This paper<ref>
Xu, Kelvin, et al. [http://arxiv.org/pdf/1502.03044v2.pdf "Show, attend and tell: Neural image caption generation with visual attention."] arXiv preprint arXiv:1502.03044 (2015).
</ref> introduces an attention based model that automatically learns to describe the content of images. It is able to focus on salient parts of the image while generating the corresponding word in the output sentence. A visualization is provided showing which part of the image was attended to to generate each specific word in the output. This can be used to get a sense of what is going on in the model and is especially useful for understanding the kinds of mistakes it makes. The model is tested on three datasets, Flickr8k, Flickr30k, and MS COCO.

= Motivation =
Caption generation and compressing huge amounts of salient visual information into descriptive language were recently improved by combination of convolutional neural networks and recurrent neural networks. . Using representations from the top layer of a convolutional net that distill information in image down to the most salient objects can lead to losing information which could be useful for richer, more descriptive captions. Retaining this information using more low-level representation was the motivation for the current work.

= Contributions =

* Two attention-based image caption generators using a common framework. A "soft" deterministic attention mechanism and a "hard" stochastic mechanism.
* Show how to gain insight and interpret results of this framework by visualizing "where" and "what" the attention focused on.
* Quantitatively validate the usefulness of attention in caption generation with state of the art performance on three datasets (Flickr8k, Flickr30k, and MS COCO)

= Model =

The model takes in a single image and generates a caption of arbitrary length. The caption is a sequence of [http://stackoverflow.com/questions/17469835/one-hot-encoding-for-machine-learning one-hot encoded words] (binary vector) from a given vocabulary.

[[File:AttentionOneHotEncoding.png]]

[[File:AttentionNetwork.png]]

== Encoder: Convolutional Features ==

Feature vectors are extracted from a convolutional neural network to use as input for the attention mechanism. The extractor produces ''L'' D-dimensional vectors corresponding to a part of the image.

[[File:AttentionAnnotationVectors.png]]

Unlike previous work, features are extracted from a lower convolutional layer instead of a fully connected layer. This allows the feature vectors to have a correspondence with portions of the 2D image.

== Decoder: Long Short-Term Memory Network ==

[[File:AttentionLSTM.png]]

The purpose of the LSTM is to output a sequence of 1-of-K encodings represented as:

<math>y={y_1,\dots,y_C},y_i\in\mathbb{R}^K</math>, where C is the length of the caption and K is the vocabulary size

To generate this sequence of outputs, a set of feature vectors was extracted from the image using a convolutional neural network and represented as:

<math>a={a_1,\dots,a_L},a_i\in\mathbb{R}^D</math>, where D is the dimension of the feature vector extracted by the convolutional neural network

Let <math>T_{s,t} : \mathbb{R}^s -> \mathbb{R}^t </math> be a simple affine transformation, i.e.<math>\,Wx + b</math> for some projection weight matrix W and some bias vector b learned as parameters in the LSTM.

The equations for the LSTM can then be simplified as:

<math>\begin{pmatrix}i_t\\f_t\\o_t\\g_t\end{pmatrix}=\begin{pmatrix}\sigma\\\sigma\\\sigma\\tanh\end{pmatrix}T_{D+m+n,n}\begin{pmatrix}Ey_{t-1}\\h_{t-1}\\\hat z_{t}\end{pmatrix}</math>

<math>c_t=f_t\odot c_{t-1} + i_t \odot g_t</math>

<math>h_t=o_t \odot tanh(c_t)</math>

where <math>\,i_t,f_t,o_t,g_t,c_t,h_t</math> corresponds the values and gate labels in the diagram. Additionally, <math>\,\sigma</math> is the logistic sigmoid function and both it and <math>\,tanh</math> are applied element wise in the first equation.

At each time step, the LSTM outputs the relative probability of every single word in the vocabulary given a context vector, the previous hidden state and the previously generated word. This is done through additional feedforward layers between the LSTM layers and the output layer, known as deep output layer setup, that take the state of the LSTM <math>\,h_t</math> and applies additional transformations to the get relative probability:

<math>p(y_t,a,y_1^{t-1})\propto exp(L_o(Ey_{t-1}+L_hh_t+L_z\hat z_t))</math>

where <math>L_o\in\mathbb{R}^{Kxm},L_h\in\mathbb{R}^{mxn},L_z\in\mathbb{R}^{mxD},E\in\mathbb{R}^{mxK}</math> are randomly initialized parameters that are learned through training the LSTM. This series of matrix and vector multiplication then results in a vector of dimension K where each element represents the relative probability of the word indexed with that element being next in the sequence of outputs.

<math>\hat{z}</math> is the context vector which is a function of the feature vectors <math>a={a_1,\dots,a_L}</math> and the attention model as discussed in the next section.

== Attention: Two Variants ==

The attention algorithm is one of the arguments that influences the state of the LSTM. There are two variants of the attention algorithm used: stochastic "hard" and deterministic "soft" attention. The visual differences between the two can be seen in the "Properties" section.

Stochastic "hard" attention means learning to maximize the context vector <math>\hat{z}</math> from a combination of a one-hot encoded variable <math>s_{t,i}</math> and the extracted features <math>a_{i}</math>. This is called "hard" attention, because a hard choice is made at each feature, however it is stochastic since <math>s_{t,i}</math> is chosen from a mutlinoulli distribution. In this approach the location variable <math>s_t</math> is presented as where the model decides to focus attention when generating the <math>t^{th}</math> word. [http://cs.brown.edu/courses/cs195-5/spring2012/lectures/2012-01-31_probabilityDecisions.pdf (see page 11 for an explanation of the distribution of this link)].

Learning stochastic attention requires sampling the attention location st each time, instead we can take the expectation of the context vector <math>zˆt</math> directly and formulate a deterministic attention model by computing a soft attention weighted annotation vector<ref name=BaD>
Bahdanau, Dzmitry, ''et al'' [http://arxiv.org/pdf/1409.0473.pdf"Neural machine translation by jointly learning to align and translate."] in arXiv, (2014).
</ref>. Deterministic soft-attention means learning by maximizing the expectation of the context vector. It is deterministic, since <math>s_{t,i}</math> is not picked from a distribution and it is soft since the individual choices are not optimized, but the whole distribution.

The actual optimization methods for both of these attention methods are outside the scope of this summary.

== Properties ==

"where" the network looks next depends on the sequence of words that has already been generated.

The attention framework learns latent alignments from scratch instead of explicitly using object detectors. This allows the model to go beyond "objectness" and learn to attend to abstract concepts.

[[File:AttentionHighlights.png]]

== Training ==

Each mini-batch used in training contained captions with similar length. This is because the implementation requires time proportional to the longest length sentence per update, so having all of the sentences in each update have similar length improved the convergence speed dramatically.

Two regularization techniques were used, drop out and early stopping on BLEU score. Since BLEU is the more commonly reported metric, BLEU is used on the validation set for model selection.

The MS COCO dataset has more than 5 reference sentences for some of the images, while the Flickr datasets have exactly 5. For consistency, the reference sentences for all images in the MS COCO dataset was truncated to 5. There was also some basic tokenization applied to the MS COCO dataset to be consistent with the tokenization in the Flickr datasets.

On the largest dataset (MS COCO) the attention model took less than 3 days to train on NVIDIA Titan Black GPU.

= Results =

Results reported with the [https://en.wikipedia.org/wiki/BLEU BLEU] and [https://en.wikipedia.org/wiki/METEOR METEOR] metrics. BLEU is one of the most common metrics for translation tasks, but due to some criticism of the metric, another is used as well. Both of these metrics are designed for evaluating machine translation, which is typically from one language to another. Caption generation can be thought of as analogous to translation, where the image is a sentence in the original 'language' and the caption is its translation to English (or another language, but in this case the captions are only in English).

[[File:AttentionResults.png]]

[[File:AttentionGettingThingsRight.png]]

[[File:AttentionGettingThingsWrong.png]]

=References=
<references />

show, Attend and Tell: Neural Image Caption Generation with Visual Attention

2015-11-20T03:20:06Z

Amirlk: /* Attention: Two Variants */

= Introduction =

This paper<ref>
Xu, Kelvin, et al. [http://arxiv.org/pdf/1502.03044v2.pdf "Show, attend and tell: Neural image caption generation with visual attention."] arXiv preprint arXiv:1502.03044 (2015).
</ref> introduces an attention based model that automatically learns to describe the content of images. It is able to focus on salient parts of the image while generating the corresponding word in the output sentence. A visualization is provided showing which part of the image was attended to to generate each specific word in the output. This can be used to get a sense of what is going on in the model and is especially useful for understanding the kinds of mistakes it makes. The model is tested on three datasets, Flickr8k, Flickr30k, and MS COCO.

= Motivation =
Caption generation and compressing huge amounts of salient visual information into descriptive language were recently improved by combination of convolutional neural networks and recurrent neural networks. . Using representations from the top layer of a convolutional net that distill information in image down to the most salient objects can lead to losing information which could be useful for richer, more descriptive captions. Retaining this information using more low-level representation was the motivation for the current work.

= Contributions =

* Two attention-based image caption generators using a common framework. A "soft" deterministic attention mechanism and a "hard" stochastic mechanism.
* Show how to gain insight and interpret results of this framework by visualizing "where" and "what" the attention focused on.
* Quantitatively validate the usefulness of attention in caption generation with state of the art performance on three datasets (Flickr8k, Flickr30k, and MS COCO)

= Model =

The model takes in a single image and generates a caption of arbitrary length. The caption is a sequence of [http://stackoverflow.com/questions/17469835/one-hot-encoding-for-machine-learning one-hot encoded words] (binary vector) from a given vocabulary.

[[File:AttentionOneHotEncoding.png]]

[[File:AttentionNetwork.png]]

== Encoder: Convolutional Features ==

Feature vectors are extracted from a convolutional neural network to use as input for the attention mechanism. The extractor produces ''L'' D-dimensional vectors corresponding to a part of the image.

[[File:AttentionAnnotationVectors.png]]

Unlike previous work, features are extracted from a lower convolutional layer instead of a fully connected layer. This allows the feature vectors to have a correspondence with portions of the 2D image.

== Decoder: Long Short-Term Memory Network ==

[[File:AttentionLSTM.png]]

The purpose of the LSTM is to output a sequence of 1-of-K encodings represented as:

<math>y={y_1,\dots,y_C},y_i\in\mathbb{R}^K</math>, where C is the length of the caption and K is the vocabulary size

To generate this sequence of outputs, a set of feature vectors was extracted from the image using a convolutional neural network and represented as:

<math>a={a_1,\dots,a_L},a_i\in\mathbb{R}^D</math>, where D is the dimension of the feature vector extracted by the convolutional neural network

Let <math>T_{s,t} : \mathbb{R}^s -> \mathbb{R}^t </math> be a simple affine transformation, i.e.<math>\,Wx + b</math> for some projection weight matrix W and some bias vector b learned as parameters in the LSTM.

The equations for the LSTM can then be simplified as:

<math>\begin{pmatrix}i_t\\f_t\\o_t\\g_t\end{pmatrix}=\begin{pmatrix}\sigma\\\sigma\\\sigma\\tanh\end{pmatrix}T_{D+m+n,n}\begin{pmatrix}Ey_{t-1}\\h_{t-1}\\\hat z_{t}\end{pmatrix}</math>

<math>c_t=f_t\odot c_{t-1} + i_t \odot g_t</math>

<math>h_t=o_t \odot tanh(c_t)</math>

where <math>\,i_t,f_t,o_t,g_t,c_t,h_t</math> corresponds the values and gate labels in the diagram. Additionally, <math>\,\sigma</math> is the logistic sigmoid function and both it and <math>\,tanh</math> are applied element wise in the first equation.

At each time step, the LSTM outputs the relative probability of every single word in the vocabulary given a context vector, the previous hidden state and the previously generated word. This is done through additional feedforward layers between the LSTM layers and the output layer, known as deep output layer setup, that take the state of the LSTM <math>\,h_t</math> and applies additional transformations to the get relative probability:

<math>p(y_t,a,y_1^{t-1})\propto exp(L_o(Ey_{t-1}+L_hh_t+L_z\hat z_t))</math>

where <math>L_o\in\mathbb{R}^{Kxm},L_h\in\mathbb{R}^{mxn},L_z\in\mathbb{R}^{mxD},E\in\mathbb{R}^{mxK}</math> are randomly initialized parameters that are learned through training the LSTM. This series of matrix and vector multiplication then results in a vector of dimension K where each element represents the relative probability of the word indexed with that element being next in the sequence of outputs.

<math>\hat{z}</math> is the context vector which is a function of the feature vectors <math>a={a_1,\dots,a_L}</math> and the attention model as discussed in the next section.

== Attention: Two Variants ==

The attention algorithm is one of the arguments that influences the state of the LSTM. There are two variants of the attention algorithm used: stochastic "hard" and deterministic "soft" attention. The visual differences between the two can be seen in the "Properties" section.

Stochastic "hard" attention means learning to maximize the context vector <math>\hat{z}</math> from a combination of a one-hot encoded variable <math>s_{t,i}</math> and the extracted features <math>a_{i}</math>. This is called "hard" attention, because a hard choice is made at each feature, however it is stochastic since <math>s_{t,i}</math> is chosen from a mutlinoulli distribution. In this approach the location variable <math>s_t</math> is presented as where the model decides to focus attention when generating the <math>t^{th}</math> word. [http://cs.brown.edu/courses/cs195-5/spring2012/lectures/2012-01-31_probabilityDecisions.pdf (see page 11 for an explanation of the distribution of this link)].

Learning stochastic attention requires sampling the attention location st each time, instead we can take the expectation of the context vector <math>zˆt</math> directly and formulate a deterministic attention model by computing a soft attention weighted annotation vector. Deterministic soft-attention means learning by maximizing the expectation of the context vector. It is deterministic, since <math>s_{t,i}</math> is not picked from a distribution and it is soft since the individual choices are not optimized, but the whole distribution.

The actual optimization methods for both of these attention methods are outside the scope of this summary.

== Properties ==

"where" the network looks next depends on the sequence of words that has already been generated.

The attention framework learns latent alignments from scratch instead of explicitly using object detectors. This allows the model to go beyond "objectness" and learn to attend to abstract concepts.

[[File:AttentionHighlights.png]]

== Training ==

Each mini-batch used in training contained captions with similar length. This is because the implementation requires time proportional to the longest length sentence per update, so having all of the sentences in each update have similar length improved the convergence speed dramatically.

Two regularization techniques were used, drop out and early stopping on BLEU score. Since BLEU is the more commonly reported metric, BLEU is used on the validation set for model selection.

The MS COCO dataset has more than 5 reference sentences for some of the images, while the Flickr datasets have exactly 5. For consistency, the reference sentences for all images in the MS COCO dataset was truncated to 5. There was also some basic tokenization applied to the MS COCO dataset to be consistent with the tokenization in the Flickr datasets.

On the largest dataset (MS COCO) the attention model took less than 3 days to train on NVIDIA Titan Black GPU.

= Results =

Results reported with the [https://en.wikipedia.org/wiki/BLEU BLEU] and [https://en.wikipedia.org/wiki/METEOR METEOR] metrics. BLEU is one of the most common metrics for translation tasks, but due to some criticism of the metric, another is used as well. Both of these metrics are designed for evaluating machine translation, which is typically from one language to another. Caption generation can be thought of as analogous to translation, where the image is a sentence in the original 'language' and the caption is its translation to English (or another language, but in this case the captions are only in English).

[[File:AttentionResults.png]]

[[File:AttentionGettingThingsRight.png]]

[[File:AttentionGettingThingsWrong.png]]

=References=
<references />

show, Attend and Tell: Neural Image Caption Generation with Visual Attention

2015-11-20T03:17:08Z

Amirlk: /* Attention: Two Variants */

= Introduction =

This paper<ref>
Xu, Kelvin, et al. [http://arxiv.org/pdf/1502.03044v2.pdf "Show, attend and tell: Neural image caption generation with visual attention."] arXiv preprint arXiv:1502.03044 (2015).
</ref> introduces an attention based model that automatically learns to describe the content of images. It is able to focus on salient parts of the image while generating the corresponding word in the output sentence. A visualization is provided showing which part of the image was attended to to generate each specific word in the output. This can be used to get a sense of what is going on in the model and is especially useful for understanding the kinds of mistakes it makes. The model is tested on three datasets, Flickr8k, Flickr30k, and MS COCO.

= Motivation =
Caption generation and compressing huge amounts of salient visual information into descriptive language were recently improved by combination of convolutional neural networks and recurrent neural networks. . Using representations from the top layer of a convolutional net that distill information in image down to the most salient objects can lead to losing information which could be useful for richer, more descriptive captions. Retaining this information using more low-level representation was the motivation for the current work.

= Contributions =

* Two attention-based image caption generators using a common framework. A "soft" deterministic attention mechanism and a "hard" stochastic mechanism.
* Show how to gain insight and interpret results of this framework by visualizing "where" and "what" the attention focused on.
* Quantitatively validate the usefulness of attention in caption generation with state of the art performance on three datasets (Flickr8k, Flickr30k, and MS COCO)

= Model =

The model takes in a single image and generates a caption of arbitrary length. The caption is a sequence of [http://stackoverflow.com/questions/17469835/one-hot-encoding-for-machine-learning one-hot encoded words] (binary vector) from a given vocabulary.

[[File:AttentionOneHotEncoding.png]]

[[File:AttentionNetwork.png]]

== Encoder: Convolutional Features ==

Feature vectors are extracted from a convolutional neural network to use as input for the attention mechanism. The extractor produces ''L'' D-dimensional vectors corresponding to a part of the image.

[[File:AttentionAnnotationVectors.png]]

Unlike previous work, features are extracted from a lower convolutional layer instead of a fully connected layer. This allows the feature vectors to have a correspondence with portions of the 2D image.

== Decoder: Long Short-Term Memory Network ==

[[File:AttentionLSTM.png]]

The purpose of the LSTM is to output a sequence of 1-of-K encodings represented as:

<math>y={y_1,\dots,y_C},y_i\in\mathbb{R}^K</math>, where C is the length of the caption and K is the vocabulary size

To generate this sequence of outputs, a set of feature vectors was extracted from the image using a convolutional neural network and represented as:

<math>a={a_1,\dots,a_L},a_i\in\mathbb{R}^D</math>, where D is the dimension of the feature vector extracted by the convolutional neural network

Let <math>T_{s,t} : \mathbb{R}^s -> \mathbb{R}^t </math> be a simple affine transformation, i.e.<math>\,Wx + b</math> for some projection weight matrix W and some bias vector b learned as parameters in the LSTM.

The equations for the LSTM can then be simplified as:

<math>\begin{pmatrix}i_t\\f_t\\o_t\\g_t\end{pmatrix}=\begin{pmatrix}\sigma\\\sigma\\\sigma\\tanh\end{pmatrix}T_{D+m+n,n}\begin{pmatrix}Ey_{t-1}\\h_{t-1}\\\hat z_{t}\end{pmatrix}</math>

<math>c_t=f_t\odot c_{t-1} + i_t \odot g_t</math>

<math>h_t=o_t \odot tanh(c_t)</math>

where <math>\,i_t,f_t,o_t,g_t,c_t,h_t</math> corresponds the values and gate labels in the diagram. Additionally, <math>\,\sigma</math> is the logistic sigmoid function and both it and <math>\,tanh</math> are applied element wise in the first equation.

At each time step, the LSTM outputs the relative probability of every single word in the vocabulary given a context vector, the previous hidden state and the previously generated word. This is done through additional feedforward layers between the LSTM layers and the output layer, known as deep output layer setup, that take the state of the LSTM <math>\,h_t</math> and applies additional transformations to the get relative probability:

<math>p(y_t,a,y_1^{t-1})\propto exp(L_o(Ey_{t-1}+L_hh_t+L_z\hat z_t))</math>

where <math>L_o\in\mathbb{R}^{Kxm},L_h\in\mathbb{R}^{mxn},L_z\in\mathbb{R}^{mxD},E\in\mathbb{R}^{mxK}</math> are randomly initialized parameters that are learned through training the LSTM. This series of matrix and vector multiplication then results in a vector of dimension K where each element represents the relative probability of the word indexed with that element being next in the sequence of outputs.

<math>\hat{z}</math> is the context vector which is a function of the feature vectors <math>a={a_1,\dots,a_L}</math> and the attention model as discussed in the next section.

== Attention: Two Variants ==

The attention algorithm is one of the arguments that influences the state of the LSTM. There are two variants of the attention algorithm used: stochastic "hard" and deterministic "soft" attention. The visual differences between the two can be seen in the "Properties" section.

Stochastic "hard" attention means learning to maximize the context vector <math>\hat{z}</math> from a combination of a one-hot encoded variable <math>s_{t,i}</math> and the extracted features <math>a_{i}</math>. This is called "hard" attention, because a hard choice is made at each feature, however it is stochastic since <math>s_{t,i}</math> is chosen from a mutlinoulli distribution. In this approach the location variable <math>s_t</math> is presented as where the model decides to focus attention when generating the <math>t^{th}</math> word. [http://cs.brown.edu/courses/cs195-5/spring2012/lectures/2012-01-31_probabilityDecisions.pdf (see page 11 for an explanation of the distribution of this link)].

Deterministic soft-attention means learning by maximizing the expectation of the context vector. It is deterministic, since <math>s_{t,i}</math> is not picked from a distribution and it is soft since the individual choices are not optimized, but the whole distribution.

The actual optimization methods for both of these attention methods are outside the scope of this summary.

== Properties ==

"where" the network looks next depends on the sequence of words that has already been generated.

The attention framework learns latent alignments from scratch instead of explicitly using object detectors. This allows the model to go beyond "objectness" and learn to attend to abstract concepts.

[[File:AttentionHighlights.png]]

== Training ==

Each mini-batch used in training contained captions with similar length. This is because the implementation requires time proportional to the longest length sentence per update, so having all of the sentences in each update have similar length improved the convergence speed dramatically.

Two regularization techniques were used, drop out and early stopping on BLEU score. Since BLEU is the more commonly reported metric, BLEU is used on the validation set for model selection.

The MS COCO dataset has more than 5 reference sentences for some of the images, while the Flickr datasets have exactly 5. For consistency, the reference sentences for all images in the MS COCO dataset was truncated to 5. There was also some basic tokenization applied to the MS COCO dataset to be consistent with the tokenization in the Flickr datasets.

On the largest dataset (MS COCO) the attention model took less than 3 days to train on NVIDIA Titan Black GPU.

= Results =

Results reported with the [https://en.wikipedia.org/wiki/BLEU BLEU] and [https://en.wikipedia.org/wiki/METEOR METEOR] metrics. BLEU is one of the most common metrics for translation tasks, but due to some criticism of the metric, another is used as well. Both of these metrics are designed for evaluating machine translation, which is typically from one language to another. Caption generation can be thought of as analogous to translation, where the image is a sentence in the original 'language' and the caption is its translation to English (or another language, but in this case the captions are only in English).

[[File:AttentionResults.png]]

[[File:AttentionGettingThingsRight.png]]

[[File:AttentionGettingThingsWrong.png]]

=References=
<references />