http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Trttse&feedformat=atomstatwiki - User contributions [US]2022-01-25T19:48:32ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=distributed_Representations_of_Words_and_Phrases_and_their_Compositionality&diff=27343distributed Representations of Words and Phrases and their Compositionality2015-12-16T22:25:21Z<p>Trttse: /* Subsampling of Frequent Words */</p>
<hr />
<div>= Introduction =<br />
<br />
This paper<ref><br />
Mikolov, Tomas, et al. [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf "Distributed representations of words and phrases and their compositionality."] Advances in neural information processing systems. 2013.<br />
</ref> presents several extensions of the Skip-gram model introduced by Mikolov et al. <ref name=MiT> Mikolov, Tomas, ''et al'' [http://arxiv.org/pdf/1301.3781v3.pdf"Efficient Estimation of Word Representations in Vector Space"] in ICLR Workshop, (2013). </ref>. The Skip-gram model is an efficient method for learning high-quality vector representations of words from large amounts of unstructured text data. The word representations computed using this model are very interesting because the learned vectors explicitly encode many linguistic regularities and patterns. Somewhat surprisingly, many of these patterns can be represented as linear translations. For example, the result of a vector calculation vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector. The authors of this paper show that subsampling of frequent words during training results in a significant speedup and improves accuracy of the representations of less frequent words. In addition, a simplified variant of Noise Contrastive Estimation (NCE) <ref name=GuM><br />
Gutmann, Michael U, ''et al'' [http://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann12JMLR.pdf"Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics"] in The Journal ofMachine Learning Research, (2012).<br />
</ref>. for training the Skip-gram model is presented that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax that was used in the prior work <ref name=MiT></ref>. It also shows that a non-obvious degree of language understanding can be obtained by using basic mathematical operations on the word vector representations. For example, vec(“Russia”) + vec(“river”) is close to vec(“Volga River”), and vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”).<br />
<br />
= The Skip-gram Model =<br />
<br />
The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document. More formally, given a sequence of training words <math>w_1, w_2,..., w_T</math> the objective of the Skip-gram model is to maximize the average log probability:<br />
<br />
<math><br />
\frac{1}{T} \sum_{t=1}^{T} \sum_{-c\leq j\leq c} log(p(w_{t+j}|w_t))<br />
</math><br />
<br /><br />
<br /><br />
where <math>c</math> is the size of the training context (which can be a function of the center word <math>w_t</math>) and <math>p(w_{t+j}|w_t)</math> is defined using softmax function:<br />
<br />
<math><br />
p(w_O|w_I) = \frac{exp ({v'_{W_O}}^T v_{W_I})}{\sum{w=1}^{W} exp ({v'_{W}}^T v_{W_I})}<br />
</math><br />
<br />
Here, <math>v_w</math> and <math>v'_w</math> are the “''input''” and “''output''” vector representations of <math>w</math>, and <math>W</math> is the number of words in the vocabulary.<br />
<br />
== Hierarchical Softmax ==<br />
<br />
Hierarchical Softmax is a computationally efficient approximation of the full softmax <ref name=MoF><br />
Morin, Frederic, ''et al'' [http://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf"Hierarchical probabilistic neural network language model"] in Proceedings of the international workshop on artificial intelligence and statistics, (2015).<br />
</ref>. Hierarchical Softmax evaluate only about <math>log_2(W)</math> output nodes instead of evaluating <math>W</math> nodes in the neural network to obtain the probability distribution.<br />
<br />
The hierarchical softmax uses a binary tree representation of the output layer with the <math>W</math> words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes. These define a random walk that assigns probabilities to words.<br />
<br />
Let <math>n(w,j)</math> be the <math>j^{th}</math> node on the path from the root to <math>w</math>, and let <math>L(w)</math> be the length of this path, so <math>n(w,1) = root</math> and <math>n(w,L(w)) = w</math>. In addition, for any inner node <math>n</math>, let <math>ch(n)</math> be an arbitrary fixed child of <math>n</math> and let <math>[[x]]</math> be 1 if <math>x</math> is true and -1 otherwise. Then the hierarchical softmax defines <math>p(w_O|w_I )</math> as follows:<br />
<br />
<math><br />
p(w|w_I) = \prod_{j=1}^{L(w)-1} \sigma ([[n(w,j+1)=ch(n(w,j))]]{v'_{n(w,j)}}^T v_{W_I}) <br />
</math><br />
<br />
where<br />
<br />
<math><br />
\sigma (x)=\frac{1}{1+exp(-x)}<br />
</math><br />
<br />
In this paper, a binary Huffman tree is used as the structure for the hierarchical softmax because it assigns short codes to the frequent words which results in fast training. It has been observed before that grouping words together by their frequency works well as a very simple speedup technique for the neural network based language models <ref name=MiT></ref><ref name=MiT2><br />
Mikolov, Tomas, ''et al'' [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5947611"Extensions of recurrent neural network language model."] in Acoustics, Speech and Signal Processing (ICASSP), (2011).<br />
</ref>.<br />
<br />
<br />
<br />
== Negative Sampling==<br />
<br />
Noise Contrastive Estimation (NCE) is an alternative to the hierarchical softmax. NCE indicates that a good model should be able to differentiate data from noise by means of logistic regression. While NCE can be shown to approximately maximize the log probability of the softmax, the Skipgram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality. Negative sampling (NEG) is defined by the objective:<br />
<br />
<math><br />
log \sigma ({v'_{W_O}}^T v_{W_I})+\sum_{i=1}^{k} \mathbb{E}_{w_i\sim P_n(w)}[log \sigma ({-v'_{W_i}}^T v_{W_I})]<br />
</math><br />
<br />
The main difference between the Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples. And while NCE approximately maximizes the log probability of the softmax, this property is not important for our application.<br /><br />
Both NCE and NEG have the noise distribution <math>P_n(w)</math> as a free parameter. We investigated a number of choices for <math>P_n(w)</math> and found that the unigram distribution <math>U(w)</math> raised to the 3/4rd power (i.e., <math>U(w)^{3/4}/Z)</math> outperformed significantly the unigram and the uniform distributions, for both NCE and NEG on every task we tried including language modeling.<br />
<br />
==Subsampling of Frequent Words==<br />
In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g., “in”, “the”, and “a”). Such words usually provide less information about the surrounding words that rarer words (i.e., "the" provides little information about the next word because it co-occurs with a huge number of words), and the representation of the frequent word will be unlikely to change significantly after many iterations. <br />
<br />
To counter the imbalance between the rare and frequent words, a simple subsampling approach is used. Each word <math>w_i</math> in the training set is discarded with probability computed by the formula:<br />
<br />
<math><br />
P(w_i)=1-\sqrt{\frac{t}{f(w_i)}}<br />
</math><br />
<br />
where <math>f(w_i)</math> is the frequency of word <math>w_i</math> and <math>t</math> is a chosen threshold, typically around <math>10^{−5}</math>. This subsampling formula was chosen because it aggressively subsamples words whose frequency is greater than ''t'' while preserving the ranking of the frequencies. The subsampling formula was chosen heuristically, but it works well in practice. IT accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words.<br />
<br />
= Empirical Results=<br />
<br />
The Hierarchical Softmax (HS), Noise Contrastive Estimation, Negative Sampling, and subsampling of the training words are evaluated with the help of the analogical reasoning task1 <ref name=MiT></ref>. The task consists of analogies such as “Germany” : “Berlin” :: “France” : ?, which are solved by finding a vector ''x'' such that vec(''x'') is closest to vec(“Berlin”) - vec(“Germany”) + vec(“France”) according to the cosine distance. This specific example is considered to have been answered correctly if ''x'' is “Paris”. The task has two broad categories: the syntactic analogies (such as “quick” : “quickly” :: “slow” : “slowly”) and the semantic analogies, such as the country to capital city relationship.<br />
<br />
For training the Skip-gram models, a large dataset consisting of various news articles is used (an internal Google dataset with one billion words). All words that occurred less than 5 times in the training data were discarded, which resulted in a vocabulary of size 692K. The performance of various Skip-gram models on the word analogy test set is reported in Table 1. The table shows that Negative Sampling outperforms the Hierarchical Softmax on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation. The subsampling of the frequent words improves the training speed several times and makes the word representations significantly more accurate.<br />
<br />
<center><br />
[[File:Tb_1.PNG | frame | center |Table 1. Accuracy of various Skip-gram 300-dimensional models on the analogical reasoning task as defined in <ref name=MiT></ref>. NEG-''k'' stands for Negative Sampling with ''k'' negative samples for each positive sample; NCE stands for Noise Contrastive Estimation and HS-Huffman stands for the Hierarchical Softmax with the frequency-based Huffman codes. ]]<br />
</center><br />
<br />
=Learning Phrases=<br />
<br />
Many phrases have a meaning that is not a simple composition of the meanings of its individual words. To learn vector representation for phrases, we first find words that appear frequently together, and infrequently in other contexts. For example, “''New York Times''” and “''Toronto Maple Leafs''” are replaced by unique tokens in the training data, while a bigram “''this is''” will remain unchanged. This way, we can form many reasonable phrases without greatly increasing the size of the vocabulary; in theory, we can train the Skip-gram model using all n-grams, but that would be too memory intensive. A simple data-driven approach, where phrases are formed based on the unigram and bigram counts is applied to identify the phrases. In this approach, a ''score'' is calculated as:<br />
<br />
<math><br />
score(w_i,w_j)=\frac{count(w_iw_j)-\delta}{count(w_i)count(w_j)}<br />
</math><br />
<br />
The <math>\delta</math> is used as a discounting coefficient and prevents too many phrases consisting of very infrequent words to be formed. The bigrams with ''scores'' above the chosen threshold are then used as phrases. The quality of the phrase representations is evaluated using a new analogical reasoning task that involves phrases. Table 2 shows examples of the five categories of analogies used in this task.<br />
<br />
<center><br />
[[File:Tb_2.PNG | frame | center |Table 2. Examples of the analogical reasoning task for phrases (the full test set has 3218 examples). The goal is to compute the fourth phrase using the first three. Our best model achieved an accuracy of 72% on this dataset.]]<br />
</center><br />
<br />
==Phrase Skip-Gram Results==<br />
<br />
First, the phrase based training corpus is constructed and then Skip-gram models are trained using different hyperparameters. Table 3 shows the results using vector dimensionality 300 and context size 5. This setting already achieves good performance on the phrase dataset, and allowed us to quickly compare the Negative Sampling and the Hierarchical Softmax, both with and without subsampling of the frequent tokens. The results show that while Negative Sampling achieves a respectable accuracy even with ''k = 5'', using ''k = 15'' achieves considerably better performance. Also, the subsampling can result in faster training and can also improve accuracy, at least in some cases.<br />
<br />
<center><br />
[[File:Tb_3.PNG | frame | center |Table 3. Accuracies of the Skip-gram models on the phrase analogy dataset. The models were trained on approximately one billion words from the news dataset.]]<br />
</center><br />
<br />
<br />
The amount of the training data was increased to 33 billion words in order to maximize the accuracy on the phrase analogy task. Hierarchical softmax, dimensionality of 1000, and the entire sentence for the context were used. This resulted in a model that reached an accuracy of 72%. Reducing the size of the training dataset to 6 billion caused lower accuracy (66%), which suggests that large amount of the training data is crucial. To gain further insight into how different the representations learned by different models are, nearest neighbors of infrequent phrases were inspected manually using various models. In Table 4 shows a sample of such comparison. Consistently with the previous results, it seems that the best representations of phrases are learned by a model with the hierarchical softmax and subsampling.<br />
<br />
<center><br />
[[File:Tb_4.PNG | frame | center |Table 4. Examples of the closest entities to the given short phrases, using two different models.]]<br />
</center><br />
<br />
=Additive Compositionality=<br />
<br />
The word and phrase representations learned by the Skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetics. Also, the Skip-gram representations exhibit another kind of linear structure that makes it possible to meaningfully combine words by an element-wise addition of their vector representations. This phenomenon is illustrated in Table 5. The additive property of the vectors can be explained by inspecting the training objective. The word vectors are in a linear relationship with the inputs to the softmax nonlinearity. As the word vectors are trained to predict the surrounding words in the sentence, the vectors can be seen as representing the distribution of the context in which a word appears. These values are related logarithmically to the probabilities computed by the output layer, so the sum of two word vectors is related to the product of the two context distributions. The product works here as the AND function: words that are assigned high probabilities by both word vectors will have high probability, and the other words will have low probability.<br />
<br />
<center><br />
[[File:Tb_5.PNG | frame | center |Table 5. Vector compositionality using element-wise addition. Four closest tokens to the sum of two vectors are shown, using the best Skip-gram model.]]<br />
</center><br />
<br />
=Comparison to Published Word Representations=<br />
<br />
Table 6 shows the empirical comparison between different neural network-based representations of words by showing the nearest neighbors of infrequent words. These examples show that the big Skip-gram model trained on a large corpus visibly outperforms all the other models in the quality of the learned representations. This can be attributed in part to the fact that this model has been trained on about 30 billion words, which is about two to three orders of magnitude more data than the typical size used in the prior work. Interestingly, although the training set is much larger, the training time of the Skip-gram model is just a fraction of the time complexity required by the previous model architectures.<br />
<br />
<center><br />
[[File:Tb_6.PNG | frame | center |Table 6. Examples of the closest tokens given various well-known models and the Skip-gram model trained on phrases using over 30 billion training words. An empty cell means that the word was not in the vocabulary.]]<br />
</center><br />
<br />
=Conclusion=<br />
<br />
This work has the following key contributions:<br />
<br />
1. This work shows how to train distributed representations of words and phrases with the Skip-gram model and demonstrate that these representations exhibit linear structure that makes precise analogical reasoning possible.<br />
<br />
2. It is a computationally efficient model architecture which results in successfully train models on several orders of magnitude more data than the previously published models.<br />
<br />
3. Introducing Negative sampling algorithm, which is an extremely simple training method that learns accurate representations especially for frequent words.<br />
<br />
4. The choice of the training algorithm and the hyper-parameter selection is a task specific decision. It is shown that the most crucial decisions that affect the performance are the choice of the model architecture, the size of the vectors, the subsampling rate, and the size of the training window.<br />
<br />
5. The word vectors can be meaningfully combined using just simple vector addition. Another approach for learning representations of phrases presented in this paper is to simply represent the phrases with a single token. Combining these two approaches gives a powerful yet simple way how to represent longer pieces of text, while having minimal computational complexity.<br />
<br />
Le et al<ref><br />
Le Q, Mikolov T. [http://arxiv.org/pdf/1405.4053v2.pdf "Distributed Representations of Sentences and Documents"]. Proceedings of the 31 st International Conference on Machine Learning, 2014 </ref> have used the idea of the current paper for learning paragraph vectors. In that later work they used paragraph vectors for prediction of the next word. Every word and also every paragraph is mapped to a unique vector represented in a column of two different matrices W and D. Then paragraph vectors and word vectors are concatenated to contribute for predicting the next word. <br />
<br />
= Recursive Autoencoder =<br />
<br />
This is taken from paper 'Semi-supervised recursive autoencoders for predicting sentiment distributions'.<ref> Socher, et al. [http://www.socher.org/uploads/Main/SocherPenningtonHuangNgManning_EMNLP2011.pdf] </ref><br />
=== Other techniques for sentence representation ===<br />
<br />
The idea of Recursive Autoencoder is summarized in the figure below. The example illustrates the recursive autoencoder to a binary tree.<br />
<center><br />
[[File:Recur-auto.png]]<br />
</center><br />
<br />
Assume given a list of word vectors <math> x = (x_1, ..., x_m)</math>, we need to branch triplets of parents with children: <math> (y_1 \rightarrow x_3x_4), (y_2 \rightarrow x_2y_1), (y_3 \rightarrow x_1y_2) </math>.<br />
<br />
The first parent <math> y_1 </math> is computed from the children <math> (c_1, c_2) = (x_3, x_4)</math>: <math> p=f(W^{(1)}[c_1; c_2] + b^{(1)})</math> , where W is the parameter matrix and b is bias term. <br />
<br />
The autoencoder comes in by reconstructing children set <math> [c_1^'; c_2^'] = W^{(2)}p + b^{(2)}</math>. The object of this method is to minimized the MSE of original children set and the reconstructed children set.<br />
<br />
=Resources=<br />
<br />
The code for training the word and phrase vectors based on this paper is available in the open source project [https://code.google.com/p/word2vec/ word2vec]. This project also contains a set of pre-trained 300-dimensional vectors for 3 million words and phrases.<br />
<br />
=References=<br />
<references /></div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=27342on using very large target vocabulary for neural machine translation2015-12-16T21:43:42Z<p>Trttse: /* Methods */</p>
<hr />
<div>==Overview==<br />
<br />
This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation"<br />
<ref>S. Jean, K. Cho, R Memisevic, and Y. Bengio. [http://arxiv.org/pdf/1412.2007v2.pdf "On Using Very Large Target Vocabulary for Neural Machine Translation"], 2015.</ref><br />
The paper presents the application of importance sampling for neural machine translation with a very large target vocabulary. Despite the advantages of neural networks in translation over the statistical machine translation systems, such as the phrase-based system, they suffer from some technical problems. Most importantly, they are limited to work with a small vocabulary because of complexity and the number of parameters that have to be trained. To explain, the output layer of an RNN used for machine translation will have as many units as there are items in the vocabulary. If the vocabulary has hundreds of thousand of terms, then the RNN must compute a very expensive softmax on the output units at each time step when predicting an output sequence. Moreover, the number of parameters in the RNN will also grow very large in such cases, given that number of weights between the hidden layer and output layer will be equal to the product of the number of units in each layer. For a non-trivially sized hidden layer, a large vocabulary could result in tens of millions of model parameters just associated with the hidden-to-output mapping performed by the model. In practice, researchers who apply RNNs to machine translation have avoided this problem by restricting the model vocabulary to only include some shortlist of words in the target language. Words not in this shortlist are treated as unknown by the model and assigned a special 'UNK' token. This technique understandably impairs translation performance when the target sentence includes a large number of words not present in the vocabulary. <br />
<br />
In this paper Jean and his colleagues aim to solve this problem by proposing a training method based on importance sampling which uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrates better performance without losing efficiency in time or speed. The algorithm is tested on two machine translation tasks (English <math>\rightarrow</math> German, and English <math>\rightarrow</math> French), and it achieved the best performance achieved by any previous single neural machine translation (NMT) system on the English <math>\rightarrow</math> French translation task.<br />
<br />
==Methods==<br />
<br />
Recall that the classic neural machine learning plays as encoder-decoder network. The encoder reads the source sentence x and encode it into a sequence of hidden states of h where <math>h_t=f(x_t,h_{t-1})</math>. In the decoder step, another neural network generates the translation vector of y based on the encoded sequence of hidden states h: <math>p(y_t\,|\,y_{<t},x)\propto \exp\{q(y_{t-1}, z_t, c_t)\}</math> where <math>\, z_t=g(y_{t-1}, z_{t-1}, c_t)</math> and <math>\, c_t=r(z_{t-1}, h_1,..., H_T)</math><br />
<br />
The objective function which have to be maximized represented by <br />
<math>\theta=\arg\max\sum_{n=1}^{N}\sum_{t=1}^{T_n}\log p(y_t^n\,|\,y_{<t}^n, x^n)</math><br />
<br />
where <math>(x^n, y^n)</math> is the n-th training pair of sentence, and <math>T_n</math> is the length of n-th target sentence <math>y^n</math>.<br />
The proposed model is based on specific implementation of neural machine translation that uses an attention mechanism, as recently proposed in <ref><br />
Bahdanau et al.,[http://arxiv.org/pdf/1409.0473v6.pdf NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE], 2014<br />
</ref>.<br />
In that the encoder is implemented by a bi-directional recurrent neural network,<math>h_t=[h_t^\leftarrow; h_t^\rightarrow]</math>. The decoder, at each time, computes the context<br />
vector <math>c_t</math> as a convex sum of the hidden states <math>(h_1,...,h_T)</math> with the coefficients <math>(\alpha_1,...,\alpha_T)</math> computed by<br />
<br />
<math>\alpha_t=\frac{\exp\{a(h_t, z_t)\}}{\sum_{k}\exp\{a(h_t, z_t)\}}</math><br />
where a is a feedforward neural network with a single hidden layer. <br />
Then the probability of the next target word is <br />
<br />
<math>p(y_t\ y_{<t}, x)=\frac{1}{Z} \exp\{W_t^T\phi(y_{t-1}, z_t, c_t)+b_t\}</math>. In that <math>\phi</math> is an affine transformation followed by a nonlinear activation, <math>w_t</math> and <math>b_t</math> are the target word vector and the target word bias, respectively. Z is the normalization constant computed by<br />
<br />
<br />
<math> Z=\sum_{k:y_k\in V}\exp\left(W_t^T\phi(y_{t-1}, z_t, c_t)+b_t\right)</math> where V is set of all the target words. <br />
<br />
<br />
The dot product between the feature <math>\phi(y_{t-1}, z_t, c_t)</math> and <math>w_t</math> is required to be done for all words in target vocabulary that is computationally complex and time consuming. Furthermore, the memory requirements grow linearly with respect to the number of target word. This has been a major hurdle for neural machine translations. Recent approaches use a shortlist of 30,000 to 80,000 most frequent words. This makes training for feasible but also has problems of its own. For example, the model degrades heavily if the translation of the source sentence requires many words that are not included in the shortlist. The approach of this paper uses only a subset of sampled target words as a align vector to maximize Eq (6), instead of all the likely target words. The most naïve way to select a subset of target words is selection of K most frequent words. However, This skipping words from training processes is in contrast with using a large vocabulary, because practically we removed a bunch of words from target dictionary. Jean et al., proposed using an existing word alignment model to align the source and target words in the training corpus and build a dictionary. With the dictionary, for each source sentence, we construct a target word set consisting of the K-most frequent words (according to the estimated unigram probability) and, using the dictionary, at most <math>k\prime</math> likely target words for each source word. K and <math>k\prime</math> may be chosen either to meet the computational requirement or to maximize the translation performance on the development set. <br />
In order to avoid the growing complexity of computing the normalization constant, the authors proposed to use only a small subset <math>v\prime</math> of the target vocabulary at each update<ref><br />
Bengio and Sen´ et al, [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4443871.pdf Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model ],IEEEXplor, 2008<br />
</ref>. <br />
Let us consider the gradient of the log probability of the output in conditional probability of <math>y_t</math>. The gradient is composed of a positive and negative part:<br />
<br />
<br />
<math>\bigtriangledown=\log p(y_t|Y_{<t}, x_t)=\bigtriangledown \mathbf\varepsilon(y_t)-\sum_{k:y_k\in V} p(y_k|y_{<t}, x) \bigtriangledown \mathbf\varepsilon(y_t) </math><br />
where the energy <math>\mathbf\varepsilon</math> is defined as <math>\mathbf\varepsilon(y_i)=W_j^T\phi(y_{j-1}, Z_j, C_j)+b_j</math>. The second term of gradiant is in essence the expected gradiant of the energy as <math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)</math>. <br />
The idea of the proposed approach is to approximate this expectation of the gradient by importance sampling with a small number of samples. Given a predefined proposal distribution Q and a set <math>v\prime</math> of samples from Q, we approximate the expectation with <br />
<br />
<math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)\approx \sum_{k:y_k\in V\prime} \frac{w_k}{\sum_{k\prime:y_k\prime\in V\prime}w_k\prime}\epsilon(y_k)</math> where <math>\,w_k=exp{\epsilon(y_k)-log Q(y_k)}</math><br />
<br />
In practice, the training corpus is partitioned and a subset <math>v\prime</math> of the target vocabulary is defined for each partition prior to training. Before training begins, each target sentence in the training corpus is sequentially examined and accumulate unique target words until the number of unique target<br />
words reaches the predefined threshold τ . The accumulated vocabulary will be used for this partition of the corpus during training. This processes is repeated until the end of the training set is reached. <br />
<br />
In this approach the alignments between the target words and source locations via the alignment model is obtained. This is useful when the model generated an Un token. Once a translation is generated given a source sentence, each Un may be replaced using a translation-specific technique based on the aligned source word. The authors in the experiment, replaced each ''Un'' token with the aligned source word or its most likely translation determined by another word alignment model.<br />
The proposed approach was evaluated in English->French and English-German translation. The neural machine translation model was trained by the bilingual, parallel corpora made available as part of WMT’14. The data sets were used for English to French were European v7, Common Crawl, UN, News Commentary, Gigaword. The data sets for English-German were Europarl v7, Common Crawl, News Commentary. <br />
<br />
The models were evaluated on the WMT’14 test set (news-test 2014)3 , while the concatenation of news-test-2012 and news-test-2013 is used for model selection (development set). Table 1 presents data coverage w.r.t. the vocabulary size, on the target side.<br />
<br />
==Setting==<br />
<br />
As a baseline for English→French translation, the authors used the RNNsearch model proposed by (Bahdanau et al., 2014), with 30,000 source and target words and also another RNNsearch was trained for English→German translation with 50,000 source and target words. Using the proposed approach another set of RNNsearch models with much larger vocabularies of 500,000 source and target words was trained for each language pair. Different shortlist sizes used during training: 15,000 and 30,000 for English→French, and 15,000 and 50,000 for English→German. The best performance on the development set were evaluated and reported every twelve hours. For both language pairs, new models were trained with shortlist size of 15, 000 and 50, 000 by reshuffling the dataset at the beginning of each epoch. While this causes a non-negligible amount of overhead, such a change allows words to be contrasted with different sets of other words each epoch. The beam search was used to generate a translation given a source. The authors keep a set of 12 hypotheses and normalize probabilities by the length of the candidate sentences which was chosen to maximize the performance on the development set, for K ∈ {15k, 30k, 50k} and K0 ∈ {10, 20}. They test using a bilingual dictionary to accelerate decoding and to replace unknown words in translations.<br />
<br />
==Results==<br />
<br />
The results for English-> French translation obtained by the trained models with very large target vocabularies compared with results of previous models reported in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Google<br />
! Phrase-based SMT (cHO et al)<br />
! Phrase-based SMT (Durrani et al)<br />
|-<br />
| BASIC NMT<br />
| 29.97 (26.58)<br />
| 32.68 (28.76)<br />
| 30.6<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 33.08 (29.08)<br />
| 33.36 (29.32)<br />
34.11 (29.98)<br />
| -<br />
33.1<br />
| 33.3<br />
| 37.03<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 34.6 (30.53)<br />
| -<br />
| 33.3<br />
| 37.03<br />
|-<br />
| + Ensemble<br />
| -<br />
| 37.19 (31.98)<br />
| 37.5 <br />
| 33.3<br />
| 3703<br />
|-<br />
|}<br />
<br />
<br />
And the results for English->German translation in Table below.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Method<br />
! RNNsearch<br />
! RNNsearch-LV<br />
! Phrase-based SMT <br />
|-<br />
| BASIC NMT<br />
| 16.46 (17.13)<br />
| 16.95 (17.85)<br />
| 20.67<br />
|-<br />
| + Candidate List <br />
+ UNK Replace<br />
| 18.97 (19.16)<br />
| 17.46 (18.00)<br />
18.89 (19.03)<br />
| 20.67<br />
|- <br />
| + Reshuffle (tau=50)<br />
| -<br />
| 19.4<br />
| 20.67<br />
|-<br />
| + Ensemble<br />
| -<br />
| 21.59<br />
| 20.67 <br />
|-<br />
|}<br />
<br />
It is clear that the RNNsearch-LV outperforms the baseline RNNsearch. In the case of the English→French task, RNNsearch-LV approached the performance level of the previous best single neural machine translation (NMT) model, even without any translationspecific techniques. With these, however, the RNNsearch-LV outperformed it. The performance of the RNNsearch-LV is also better than that of a standard phrase-based translation system. <br />
For English→German, the RNNsearch-LV outperformed the baseline before unknown word replacement, but after doing so, the two systems performed similarly. A higher large vocabulary single-model performance is achieved by reshuffling the dataset. In this case, we were able to surpass the previously reported best translation result on this task by building an ensemble of 8 models. With τ = 15, 000, the RNNsearch-LV performance worsened a little, with best BLEU scores, without reshuffling, of 33.76 and 18.59 respectively for English→French and English→German.<br />
<br />
The timing information of decoding for different models were presented in Table below. While decoding from RNNsearch-LV with the full target vocabulary is slowest, the speed substantially improves if a candidate list for decoding each translation is used. <br />
{| class="wikitable"<br />
|-<br />
! Method <br />
! CPU i7-4820k<br />
! GPU GTX TITAN black<br />
|-<br />
| RNNsearch<br />
| 0.09 s<br />
| 0.02 s<br />
|-<br />
| RNNsearch-LV <br />
| 0.80 s<br />
| 0.25 s<br />
|-<br />
| RNNsearch-LV<br />
+Candidate list<br />
| 0.12 s<br />
| 0.0.05 s<br />
|}<br />
<br />
The influence of the target vocabulary when translating the test sentences by using the union of a fixed set of 30, 000 common words and (at most) K0 likely candidates for each source word was evaluated for English→French with size of 30, 000. The performance of the system is comparable to the baseline when Uns not replaced, but there is not as much improvement when doing so.<br />
The authors found that K is inversely correlated with t. <br />
<br />
<br />
==Conclusion==<br />
<br />
Using the importance sampling an approach was proposed to be used in machine translation with a large target vocabulary without any substantial increase in computational complexity. The BLUE values for the proposed model showed translation performance comparable to the state-of-the-art translation systems on both the English→French task and English→German task.<br />
On English→French and English→German translation tasks, the neural machine translation models trained using the proposed method performed as well as, or better than, those using only limited sets of target words, even when replacing unknown words.<br />
<br />
<br />
== Bibliography ==<br />
<references /></div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=27341learning Phrase Representations2015-12-16T21:24:12Z<p>Trttse: /* RNN Encoder–Decoder */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes. At each time step t, the hidden state <math>h_{t}</math> of the RNN is updated by:<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
where ''f'' is a non-linear activation function. ''f'' may be as simple as an element-wise logistic sigmoid function and as complex as a long short-term memory (LSTM) unit. After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
Because each hidden unit has separate gates, it is possible for each hidden to unit to learn to capture dependencies over different lengths of time (determined by the frequency at which its reset and updates gates are active).<br />
<br />
=Scoring Phrase Pairs with RNN Encoder–Decoder =<br />
<br />
In a commonly used statistical machine translation system (SMT), the goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes<br />
<br />
<math>p(\mathbf{f} | \mathbf{e})\propto p(\mathbf{e} | \mathbf{f})p(\mathbf{f})</math><br />
<br />
where the first term at the right hand side is called translation model and the latter language model (see, e.g., (Koehn, 2005)). In practice, however, most SMT systems model <math>\log p(\mathbf{f} | \mathbf{e})</math> as a loglinear model with additional features and corresponding weights:<br />
<br />
<math>\log p(\mathbf{f} | \mathbf{e})=\sum_{n=1}^Nw_nf_n(\mathbf{f},\mathbf{e})+\log Z(\mathbf{e})</math><br />
<br />
where <math>f_n</math> and <math>w_n</math> are the n-th feature and weight, respectively. <math>Z(\mathbf{e})</math> is a normalization constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a development set.<br />
<br />
Cho et al. propose to train the RNN Encoder–Decoder on a table of phrase pairs and use its scores as additional features in the loglinear model showed above when tuning the SMT decoder. For training the RNN-Encoder-Decoder, phrase frequency is ignored for several reasons: to reduce computation time, to ensure the model does not simply rank phrases by frequency, and because frequency information is already encoded in the features for the SMT (so it's better to not use the capacity of the RNN-Encoder-Decoder redundantly).<br />
<br />
=Alternative Models=<br />
The researchers noted a number of other potential translation models and their usability.<br />
<br />
The first model is by Schwenk and it is an application of a variant of the continuous space language model to the task of machine translation. The model is essentially a feedforward neural network with a common projection for input words encoded as bag of words vectors. Schwenk fixed the input and output sentence length and for a fixed length, the neural network attempts to estimate the probability of the output sequence of words and score potential translations. However, a major disadvantage is that the input and output length are fixed and cannot handle variable length inputs or outputs.<br />
<br />
The model figure<ref><br />
[Schwenk2012] Holger Schwenk. 2012. Continuous<br />
space translation models for phrase-based statistical<br />
machine translation. In Martin Kay and Christian<br />
Boitet, editors, Proceedings of the 24th International<br />
Conference on Computational Linguistics<br />
(COLIN), pages 1071–1080.<br />
</ref>:<br />
<br />
[[File:CONTINUOUS.PNG]]<br />
<br />
Another model, similar to Schwenk's, is by Devlin and a feedforward neural network is also used. Rather than estimating the probability of the entire output sequence of words in Schwenk's model, Devlin only estimates the probability of the next word and uses both a portion of the input sentence and a portion of the output sentence. It reported impressive improvements but similar to Schwenk, it fixes the length of input prior to training.<br />
<br />
Chandar et al. trained a feedforward neural network to learn a mapping from a bag-of-words representation of an input phrase to an output phrase.<ref><br />
Lauly, Stanislas, et al. "An autoencoder approach to learning bilingual word representations." Advances in Neural Information Processing Systems. 2014.<br />
</ref> This is closely related to both the proposed RNN Encoder–Decoder and the model<br />
proposed by Schwenk, except that their input representation of a phrase is a bag-of-words. A similar approach of using bag-of-words representations was proposed by Gao<ref><br />
Gao, Jianfeng, et al. "Learning semantic representations for the phrase translation model." arXiv preprint arXiv:1312.0482 (2013).<br />
</ref> as well. One important difference between the proposed RNN Encoder–Decoder and the above approaches is that the order of the words in source and target phrases is taken into account. The RNN Encoder–Decoder naturally distinguishes between sequences that have the same words but in a different order, whereas the aforementioned approaches effectively ignore order information.<br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. In building the model Cho et al. used baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref><br />
<br />
The following model combinations were tested:<br />
# Baseline configuration<br />
# Baseline + RNN<br />
# Baseline + CSLM + RNN<br />
# Baseline + CSLM + RNN + Word penalty<br />
<br />
The results are shown in Figure 3. The RNN encoder-decoder consisted of 1000 hidden units. Rank-100 matrices were used to connect the input to the hidden unit. The "word penalty" attempts to penalize the words unknown to the neural network, which is accomplished by using the number of unknown words as a feature in the log-linear model above. <br />
<br />
<center><br />
[[File:encdec3.png |frame | center |Fig 3. BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks. ]]<br />
</center><br />
<br />
The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder–Decoder. This suggests that the contributions of the CSLM and the RNN Encoder– Decoder are not too correlated and that one can expect better results by improving each method independently<br />
<br />
<br />
== Word and Phrase Representations ==<br />
<br />
As the presented model maps sentences into a continuous space vector and prior continuous space language models have been known to learn semantically meaningful embeddings, one could expect this to happen for the presented model, too. This is indeed the case. When projecting to a 2D space (with Barnes-Hut-SNE), semantically similar words are clearly clustered.<br />
<br />
[[File:Fig4.png]]<br />
<br />
Phrases are also clustered capturing both semantic and syntactic structures.<br />
<br />
[[File:Fig5.png]]<br />
<br />
= References=<br />
<references /></div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=generating_text_with_recurrent_neural_networks&diff=27319generating text with recurrent neural networks2015-12-15T05:10:19Z<p>Trttse: /* The RNN as a Generative Model */</p>
<hr />
<div>= Introduction =<br />
<br />
The goal of this paper is to introduce a new type of recurrent neural network for character-level language modelling that allows the input character at a given timestep to multiplicatively gate the connections that make up the hidden-to-hidden layer weight matrix. The paper also introduces a solution to the problem of vanishing and exploding gradients by applying a technique called Hessian-Free optimization to effectively train a recurrent network that, when unrolled in time, has approximately 500 layers. At the date of publication, this network was arguably the deepest neural network ever trained successfully. <br />
<br />
Strictly speaking, a language model is a probability distribution over sequences of words or characters, and such models are typically used to predict the next character or word in a sequence given some number of preceding characters or words. Recurrent neural networks are naturally applicable to this task, since they make predictions based on a current input and a hidden state whose value is determined by some number of previous inputs. Alternative methods that the authors compare their results to include a hierarchical Bayesian model called a 'sequence memoizer' <ref> Wood, F., C. Archambeau, J. Gasthaus, L. James, and Y.W. The. [http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/airg/readings/2012_02_28_a_stochastic_memoizer.pdf "A Stochastic Memoizer for Sequence Data"] ICML, (2009) </ref> and a mixture of context models referred to as PAQ <ref> Mahoney, M. [https://repository.lib.fit.edu/bitstream/handle/11141/154/cs-2005-16.pdf?sequence=1&isAllowed=y "Adaptive Weighing of Context Models for Lossless Data Compression"], Florida Institute of Technology Technical Report, (2005) </ref>, which actually includes word-level information (rather strictly character-level information). The multiplicative RNN introduced in this paper improves on the state-of-the-art for solely character-level language modelling, but is somewhat worse than the state-of-the-art for text compression. <br />
<br />
To give a brief review, an ordinary recurrent neural network is parameterized by three weight matrices, <math>\ W_{hi} </math>, <math>\ W_{hh} </math>, and <math>\ W_{oh} </math>, and functions to map a sequence of <math> N </math> input states <math>\ [i_1, ... , i_N] </math> to a sequence of hidden states <math>\ [h_1, ... , h_N] </math> and a sequence of output states <math>\ [o_1, ... , o_N] </math>. The matrix <math>\ W_{hi} </math> parameterizes the mapping from the current input state to the current hidden state, while the matrix <math>\ W_{hh} </math> parameterizes the mapping from the previous hidden state to current hidden state, such that the current hidden state is function of the previous hidden state and the current input state. Finally, the matrix <math>\ W_{oh} </math> parameterizes the mapping from the current hidden state to the current output state. So, at a given timestep <math>\ t </math>, the values of the hidden state and output state are as follows:<br />
<br />
<br />
:<math>\ h_t = tanh(W_{hi}i_t + W_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
:<math>\ o_t = W_{oh}h_t + b_o </math> <br />
<br />
<br />
where <math>\ b_o</math> and <math>\ b_h</math> are bias vectors. Typically, the output state is converted into a probability distribution over characters or words using the softmax function. The network can then be treated as a generative model of text by sampling from this distribution and providing the sampled output as the input to the network at the next timestep.<br />
<br />
Recurrent networks are known to be very difficult to train due to the existence a highly unstable relationship between a network's parameters and the gradient of its cost function. Intuitively, the surface of the cost function is intermittently punctuated by abrupt changes (giving rise to exploding gradients) and nearly flat plateaus (giving rise to vanishing gradients) that can effectively become poor local minima when a network is trained through gradient descent. Techniques for improving training include the use of Long Short-Term Memory networks <ref> Hochreiter, Sepp, and Jürgen Schmidhuber. [http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf "Long short-term memory."] Neural computation 9.8 (1997): 1735-1780. </ref>, in which memory units are used to selectively preserve information from previous states, and the use of Echo State networks, <ref> Jaeger, H. and H. Haas. [http://www.sciencemag.org/content/304/5667/78.short "Harnassing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication."] Science, 204.5667 (2004): 78-80. </ref> which learn only the output weights on a network with recurrent connections that implement a wide range of time-varying patterns. In this paper, the method of Hessian free optimization is used instead of these alternatives. <br />
<br />
[[File:RNN.png | frame | centre | A depiction of a recurrent neural network unrolled through three time steps.]]<br />
<br />
= Hessian-Free Optimization = <br />
<br />
While this optimization technique is described elsewhere in Martens (2010) <ref> Martens, J. [http://icml2010.haifa.il.ibm.com/papers/458.pdf "Deep learning via Hessian-free optimization."] ICML, (2010) </ref><br />
, its use is essential to obtaining the successful results reported in this paper. In brief, the technique involves uses information about the 2nd derivatives of the cost function to perform more intelligent parameter updates. This information is helpful because in cases where the gradient is changing very slowly on a particular dimension, it is more efficient to take larger steps in the direction of descent along this dimension. Alternatively, if the the gradient is changing very rapidly on a particular dimension, then it makes sense to take smaller steps to avoid 'bouncing' off of a step incline in the cost function and moving to a less desirable location in parameter space. The relevant 2nd order information is computed using the method of finite differences to avoid computing the Hessian of the cost function.In fact instead of computing and inverting the H matrix when updating equations, the Gauss-Newton approximation is used for the Hessian matrix which is quite good approximation to the Hessian and practically cheaper to compute. <br />
<br />
What is important about this technique is that it provides a solution to problem of vanishing and exploding gradients during the training of recurrent neural networks. Vanishing gradients are accommodated by descending much more rapidly along the cost function in areas where it has relatively low curvature (e.g., when the cost function is nearly flat), while exploding gradients are accommodated by descending much more slowly along the cost function in areas where it has relatively high curvature (e.g., when there is a steep cliff). The figure below illustrates how hessian free optimization improves the training of neural networks in general. <br />
<br />
[[File:HFF.png | frame | centre | On the left is training with naive gradient descent, and on the right is training via the use of 2nd order information about the cost function.]]<br />
<br />
= Multiplicative Recurrent Neural Networks = <br />
<br />
The authors report that using a standard neural network trained via Hessian-free optimization produces only mediocre results. As such, they introduce a new architecture called a multiplicative recurrent neural network (MRNN). The motivating intuition behind this architecture is that the input at a given time step should both additively contribute to the hidden state (though the mapping performed by the input-to-hidden weights) and additionally determine the weights on the recurrent connections to the hidden state. This approach came from viewing an RNN as a model of an tree in which each node is a hidden state vector and each edge is labelled by a character that determines how the parent node gives rise to the child node. In other words, the idea is to define a unique weight matrix <math>\ W_{hh} </math> for each possible input. The reason this design is hypothesized to the improve the predictive adequacy of the model is due to the idea that the ''conjunction'' of the input at one time step and the hidden state at the previous time step is important. Capturing this conjunction requires the input to influence the contribution of the previous hidden state to the current hidden state. Otherwise, the previous hidden state and the current input will make entirely independent contributions to the calculation of the current hidden state. Formally, this changes the calculation of the hidden state at a given time step as follows:<br />
<br />
<br />
:<math>\ h_t = tanh(W_{hi}i_t + W^{i_t}_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
where <math>\ W^{i_t}_{hh} </math> is an input-specific hidden-to-hidden weight matrix. As a first approach to implementing this MRNN, the authors suggest using a tensor of rank 3 to store the hidden-to-hidden weights. The idea is that the tensor stores one weight matrix per possible input; when the input is provided as a one-hot vector, tensor contraction (i.e. a generalization of matrix multiplication) can be used to extract the 'slice' of the tensor that contains the appropriate set of weights. One problem with this approach is that it quickly becomes impractical to store the hidden-to-hidden weights as a tensor if the dimensionality of the hidden state has a large number of dimensions. For instance, if a network's hidden layer encodes a vector with 1000 dimensions, then the number of parameters in the tensor that need to be learned will be equal to <math>\ 1000^2 * N </math>, where <math>\ N </math> is the vocabulary size. In short, this method will add many millions of parameters to a model for a non-trivially sized vocabulary. <br />
<br />
To fix this problem, the tensor is factored using a technique described in Taylor & Hinton (2009) <ref>Taylor, G. and G. Hinton. [http://www.cs.toronto.edu/~fritz/absps/fcrbm_icml.pdf "Factored Conditional Restricted Boltzmann Machines for Modeling Motion Style"] ICML (2009) </ref>. The idea is to define three matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math> that approximate the use of a tensor in determining the value of <math>\ W^{i_t}_{hh} </math> as follows:<br />
<br />
<br />
:<math>\ W^{i_t}_{hh} = W_{hf} \cdot diag(W_{fi}i_t) \cdot W_{fh} </math><br />
<br />
<br />
Intuitively, this factorization produces two vectors from the current input state and the previous hidden state, takes their element-wise product, and applies a linear transformation to produce the input to the hidden layer at the current timestep. The triangle units in the figure below indicate where the element-wise product occurs, and the connections into and out of these units are parameterized by the matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math>. The element-wise multiplication is implemented by diagonalizing the matrix-vector product <math>\ W_{fi}i_t </math>, and if the dimensionality of this matrix-vector product (i.e. the dimensionality of the layer of multiplicative units) is allowed to be arbitrarily large, then this factorization is just as expressive as using a tensor to store the hidden-to-hidden weights. <br />
<br />
[[File:MRNN.png | frame | centre | A depiction of a multiplicative recurrent neural network unrolled through three time steps.]]<br />
<br />
In the experiments described below, an MRNN is trained via Hessian Free optimization on sequences of 250 characters. The first 50 characters used to condition the hidden state, so only 200 predictions are generated per sequence. 1500 hidden units were used, along with 1500 factors (i.e. multiplicative gates, or the triangles in the figure above), yielding an unrolled network of 500 layers if the multiplicative units are treated as forming a layer. Training was performed with a parallelized system consisting of 8 GPUs. A vocabulary of 86 characters was used in all cases.<br />
<br />
= The RNN as a Generative Model =<br />
The goal of the model is to predict the next character given a string of characters. More formally, given a training sequence <math>(x_1,...,x_T)</math>, the RNN uses its output vectors <math>(o_1,...,o_T)</math> to obtain a sequence of predictive distributions <math>P(x_{t+1}|x_{\le t}) = softmax(o_t)</math>.<br />
<br />
= Quantitative Experiments =<br />
<br />
To compare the performance of the MRNN to that of the sequence memorizer and PAQ, three 100mb datasets were used: a selection of wikipedia articles, a selection of New York Times articles, and a corpus of all available articles published in NIPS and JMLR. The last 10 million characters in each dataset were held out for testing. Additionally, the MRNN was trained on the larger corpora from which the wikipedia text and NYT articles were drawn (i.e. all of wikipedia, and the entire set of NYT articles). <br />
<br />
The models were evaluated by calculating the number of bits per character achieved by each model on the 3 test sets. This metric is essentially a measure of model perplexity, which defines how well a given model predicts the data it is being tested on. If the number of bits per character is high, this means that the model is, on average, highly uncertain about the value of each character in the test set. If the number of bits per character is low, then the model is less uncertain about the value of each character in the test set. One way to think about this quantity is as the average amount of additional information (in bits) needed by the model to exactly identify the value of each character in the test set. So, a lower measure is better, indicating that the model achieves a good representation of the underlying data. (it is sometimes helpful to think of a language model as a compressed representation of a text corpus). <br />
<br />
As illustrated in the table below, the MRNN achieves a lower number of bits per character than the hierarchical bayesian model, but a higher number of bits per character than the PAQ model (which recall, is not a strictly character level model). The numbers in brackets indicate the bits per character achieved on the training data, and the column labelled 'Full Set' reports the results of training the MRNN on the full wikipedia and NYT corpora. <br />
<br />
[[File:bits.png | frame | centre | Bits per character achieved by each model on each dataset.]]<br />
<br />
These results indicate that the MRNN beat the existing state-of-the-art for pure character-level language modelling at the time of publication. <br />
<br />
= Qualitative Experiments =<br />
<br />
By examining the output of the MRNN, it is possible to see what kinds of linguistic patterns it is able to learn. Most striking is the fact that the model consistently produces correct words from a fairly sophisticated vocabulary. The model is also able to balance parentheses and quotation marks over many time steps, and it occasionally produces plausible non-words such as 'cryptoliation' and 'homosomalist'. The text in the figure below was produced by running the model in generative mode less than 10 times using the phrase 'The meaning of life is' as an initial input, and then selecting the most interesting output sequence. The model was trained on wikipedia to produce the results in the figure below. The character '?' indicates an unknown item, and some of the spacing and punctuation oddities are due to preprocessing and are apparently common in the dataset. <br />
<br />
[[File:text.png | frame | centre | A selection of text generated by an MRNN initialized with the sequence "The meaning of life is...".]]<br />
<br />
Another interesting qualitative demonstration of the model's abilities involves initializing the model with a more complicated sequence and seeing what sort of continuations it produces. In the figure below, a number of sampled continuations of the phrase 'England, Spain, France, Germany' are shown. Generally, the model is able to provide continuations that preserve the list-like structure of the phrase. Moreover, the model is also able to recognize that the list is a list of locations, and typically offers additional locations as its predicted continuation of the sequence. <br />
<br />
[[File:locations.png | frame | centre | Selections of text generated by an MRNN initialized with the sequence "England, Spain, France, Germany".]]<br />
<br />
What is particularly impressive about these results is the fact that the model is learning a distribution over sequences of characters only. From this distribution, a broad range of syntactic and lexical knowledge emerges. It is also worth noting that it is much more efficient to train a model with a small character-level vocabulary than it is to train a model with a word-level vocabulary (which can have tens of thousands of items). As such, the character-level MRNN is able to scale to large datasets quite well.<br />
<br />
Moreover, they find that the MRNN is sensitive to some notations like the initial bracket if such string doesn't occur in the training set. They claim that any method which is based on precise context matches is fundamentally incapable of utilizing long contexts, because the probability that a long context occurs more than once is very small.<br />
<br />
= Discussion =<br />
<br />
One aspect of this work that is worth considering concerns the degree to which the use of input-dependent gating of the information being passed from hidden state to hidden state actually improves the results over and above the use of a standard recurrent neural network. Presumably, the use of hessian free optimization allows one to successfully train such a network, so it would be helpful to see a comparison to the results obtained using an MRNN.MRNNs already learn surprisingly good language models<br />
using only 1500 hidden units, and unlike other approaches such as the sequence memoizer and PAQ, they are easy to extend along various dimensions. Otherwise, it is hard to discern the relative importance of the optimization technique and the network architecture in achieving the good language modelling results reported in this paper.<br />
The MRNN assigns probability to plausible words that do not exist in the training set. This is a good property, that enabled the MRNN to deal with real words that it did not see in the training set. one advantage of this model is that, this model avoids using a huge softmax over all known words by predicting the next word based on a sequence of character predictions, while some word-level language models actually make up binary spellings of words in a way that they can predict them one bit at each time.<br />
<br />
= Bibliography = <br />
<references /></div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=generating_text_with_recurrent_neural_networks&diff=27318generating text with recurrent neural networks2015-12-15T05:09:14Z<p>Trttse: /* The RNN as a Generative Model */</p>
<hr />
<div>= Introduction =<br />
<br />
The goal of this paper is to introduce a new type of recurrent neural network for character-level language modelling that allows the input character at a given timestep to multiplicatively gate the connections that make up the hidden-to-hidden layer weight matrix. The paper also introduces a solution to the problem of vanishing and exploding gradients by applying a technique called Hessian-Free optimization to effectively train a recurrent network that, when unrolled in time, has approximately 500 layers. At the date of publication, this network was arguably the deepest neural network ever trained successfully. <br />
<br />
Strictly speaking, a language model is a probability distribution over sequences of words or characters, and such models are typically used to predict the next character or word in a sequence given some number of preceding characters or words. Recurrent neural networks are naturally applicable to this task, since they make predictions based on a current input and a hidden state whose value is determined by some number of previous inputs. Alternative methods that the authors compare their results to include a hierarchical Bayesian model called a 'sequence memoizer' <ref> Wood, F., C. Archambeau, J. Gasthaus, L. James, and Y.W. The. [http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/airg/readings/2012_02_28_a_stochastic_memoizer.pdf "A Stochastic Memoizer for Sequence Data"] ICML, (2009) </ref> and a mixture of context models referred to as PAQ <ref> Mahoney, M. [https://repository.lib.fit.edu/bitstream/handle/11141/154/cs-2005-16.pdf?sequence=1&isAllowed=y "Adaptive Weighing of Context Models for Lossless Data Compression"], Florida Institute of Technology Technical Report, (2005) </ref>, which actually includes word-level information (rather strictly character-level information). The multiplicative RNN introduced in this paper improves on the state-of-the-art for solely character-level language modelling, but is somewhat worse than the state-of-the-art for text compression. <br />
<br />
To give a brief review, an ordinary recurrent neural network is parameterized by three weight matrices, <math>\ W_{hi} </math>, <math>\ W_{hh} </math>, and <math>\ W_{oh} </math>, and functions to map a sequence of <math> N </math> input states <math>\ [i_1, ... , i_N] </math> to a sequence of hidden states <math>\ [h_1, ... , h_N] </math> and a sequence of output states <math>\ [o_1, ... , o_N] </math>. The matrix <math>\ W_{hi} </math> parameterizes the mapping from the current input state to the current hidden state, while the matrix <math>\ W_{hh} </math> parameterizes the mapping from the previous hidden state to current hidden state, such that the current hidden state is function of the previous hidden state and the current input state. Finally, the matrix <math>\ W_{oh} </math> parameterizes the mapping from the current hidden state to the current output state. So, at a given timestep <math>\ t </math>, the values of the hidden state and output state are as follows:<br />
<br />
<br />
:<math>\ h_t = tanh(W_{hi}i_t + W_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
:<math>\ o_t = W_{oh}h_t + b_o </math> <br />
<br />
<br />
where <math>\ b_o</math> and <math>\ b_h</math> are bias vectors. Typically, the output state is converted into a probability distribution over characters or words using the softmax function. The network can then be treated as a generative model of text by sampling from this distribution and providing the sampled output as the input to the network at the next timestep.<br />
<br />
Recurrent networks are known to be very difficult to train due to the existence a highly unstable relationship between a network's parameters and the gradient of its cost function. Intuitively, the surface of the cost function is intermittently punctuated by abrupt changes (giving rise to exploding gradients) and nearly flat plateaus (giving rise to vanishing gradients) that can effectively become poor local minima when a network is trained through gradient descent. Techniques for improving training include the use of Long Short-Term Memory networks <ref> Hochreiter, Sepp, and Jürgen Schmidhuber. [http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf "Long short-term memory."] Neural computation 9.8 (1997): 1735-1780. </ref>, in which memory units are used to selectively preserve information from previous states, and the use of Echo State networks, <ref> Jaeger, H. and H. Haas. [http://www.sciencemag.org/content/304/5667/78.short "Harnassing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication."] Science, 204.5667 (2004): 78-80. </ref> which learn only the output weights on a network with recurrent connections that implement a wide range of time-varying patterns. In this paper, the method of Hessian free optimization is used instead of these alternatives. <br />
<br />
[[File:RNN.png | frame | centre | A depiction of a recurrent neural network unrolled through three time steps.]]<br />
<br />
= Hessian-Free Optimization = <br />
<br />
While this optimization technique is described elsewhere in Martens (2010) <ref> Martens, J. [http://icml2010.haifa.il.ibm.com/papers/458.pdf "Deep learning via Hessian-free optimization."] ICML, (2010) </ref><br />
, its use is essential to obtaining the successful results reported in this paper. In brief, the technique involves uses information about the 2nd derivatives of the cost function to perform more intelligent parameter updates. This information is helpful because in cases where the gradient is changing very slowly on a particular dimension, it is more efficient to take larger steps in the direction of descent along this dimension. Alternatively, if the the gradient is changing very rapidly on a particular dimension, then it makes sense to take smaller steps to avoid 'bouncing' off of a step incline in the cost function and moving to a less desirable location in parameter space. The relevant 2nd order information is computed using the method of finite differences to avoid computing the Hessian of the cost function.In fact instead of computing and inverting the H matrix when updating equations, the Gauss-Newton approximation is used for the Hessian matrix which is quite good approximation to the Hessian and practically cheaper to compute. <br />
<br />
What is important about this technique is that it provides a solution to problem of vanishing and exploding gradients during the training of recurrent neural networks. Vanishing gradients are accommodated by descending much more rapidly along the cost function in areas where it has relatively low curvature (e.g., when the cost function is nearly flat), while exploding gradients are accommodated by descending much more slowly along the cost function in areas where it has relatively high curvature (e.g., when there is a steep cliff). The figure below illustrates how hessian free optimization improves the training of neural networks in general. <br />
<br />
[[File:HFF.png | frame | centre | On the left is training with naive gradient descent, and on the right is training via the use of 2nd order information about the cost function.]]<br />
<br />
= Multiplicative Recurrent Neural Networks = <br />
<br />
The authors report that using a standard neural network trained via Hessian-free optimization produces only mediocre results. As such, they introduce a new architecture called a multiplicative recurrent neural network (MRNN). The motivating intuition behind this architecture is that the input at a given time step should both additively contribute to the hidden state (though the mapping performed by the input-to-hidden weights) and additionally determine the weights on the recurrent connections to the hidden state. This approach came from viewing an RNN as a model of an tree in which each node is a hidden state vector and each edge is labelled by a character that determines how the parent node gives rise to the child node. In other words, the idea is to define a unique weight matrix <math>\ W_{hh} </math> for each possible input. The reason this design is hypothesized to the improve the predictive adequacy of the model is due to the idea that the ''conjunction'' of the input at one time step and the hidden state at the previous time step is important. Capturing this conjunction requires the input to influence the contribution of the previous hidden state to the current hidden state. Otherwise, the previous hidden state and the current input will make entirely independent contributions to the calculation of the current hidden state. Formally, this changes the calculation of the hidden state at a given time step as follows:<br />
<br />
<br />
:<math>\ h_t = tanh(W_{hi}i_t + W^{i_t}_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
where <math>\ W^{i_t}_{hh} </math> is an input-specific hidden-to-hidden weight matrix. As a first approach to implementing this MRNN, the authors suggest using a tensor of rank 3 to store the hidden-to-hidden weights. The idea is that the tensor stores one weight matrix per possible input; when the input is provided as a one-hot vector, tensor contraction (i.e. a generalization of matrix multiplication) can be used to extract the 'slice' of the tensor that contains the appropriate set of weights. One problem with this approach is that it quickly becomes impractical to store the hidden-to-hidden weights as a tensor if the dimensionality of the hidden state has a large number of dimensions. For instance, if a network's hidden layer encodes a vector with 1000 dimensions, then the number of parameters in the tensor that need to be learned will be equal to <math>\ 1000^2 * N </math>, where <math>\ N </math> is the vocabulary size. In short, this method will add many millions of parameters to a model for a non-trivially sized vocabulary. <br />
<br />
To fix this problem, the tensor is factored using a technique described in Taylor & Hinton (2009) <ref>Taylor, G. and G. Hinton. [http://www.cs.toronto.edu/~fritz/absps/fcrbm_icml.pdf "Factored Conditional Restricted Boltzmann Machines for Modeling Motion Style"] ICML (2009) </ref>. The idea is to define three matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math> that approximate the use of a tensor in determining the value of <math>\ W^{i_t}_{hh} </math> as follows:<br />
<br />
<br />
:<math>\ W^{i_t}_{hh} = W_{hf} \cdot diag(W_{fi}i_t) \cdot W_{fh} </math><br />
<br />
<br />
Intuitively, this factorization produces two vectors from the current input state and the previous hidden state, takes their element-wise product, and applies a linear transformation to produce the input to the hidden layer at the current timestep. The triangle units in the figure below indicate where the element-wise product occurs, and the connections into and out of these units are parameterized by the matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math>. The element-wise multiplication is implemented by diagonalizing the matrix-vector product <math>\ W_{fi}i_t </math>, and if the dimensionality of this matrix-vector product (i.e. the dimensionality of the layer of multiplicative units) is allowed to be arbitrarily large, then this factorization is just as expressive as using a tensor to store the hidden-to-hidden weights. <br />
<br />
[[File:MRNN.png | frame | centre | A depiction of a multiplicative recurrent neural network unrolled through three time steps.]]<br />
<br />
In the experiments described below, an MRNN is trained via Hessian Free optimization on sequences of 250 characters. The first 50 characters used to condition the hidden state, so only 200 predictions are generated per sequence. 1500 hidden units were used, along with 1500 factors (i.e. multiplicative gates, or the triangles in the figure above), yielding an unrolled network of 500 layers if the multiplicative units are treated as forming a layer. Training was performed with a parallelized system consisting of 8 GPUs. A vocabulary of 86 characters was used in all cases.<br />
<br />
= The RNN as a Generative Model =<br />
The goal of the model is to predict the next character given a string of characters. More formally, given a training sequence <math>(x_1,...,x_T)</math>, the RNN uses its output vectors <math>(o_1,...,o_T)</math> to obtain a sequence of predictive distributions <math>P(x_{t+1}|x_{\le t})</math><br />
<br />
= Quantitative Experiments =<br />
<br />
To compare the performance of the MRNN to that of the sequence memorizer and PAQ, three 100mb datasets were used: a selection of wikipedia articles, a selection of New York Times articles, and a corpus of all available articles published in NIPS and JMLR. The last 10 million characters in each dataset were held out for testing. Additionally, the MRNN was trained on the larger corpora from which the wikipedia text and NYT articles were drawn (i.e. all of wikipedia, and the entire set of NYT articles). <br />
<br />
The models were evaluated by calculating the number of bits per character achieved by each model on the 3 test sets. This metric is essentially a measure of model perplexity, which defines how well a given model predicts the data it is being tested on. If the number of bits per character is high, this means that the model is, on average, highly uncertain about the value of each character in the test set. If the number of bits per character is low, then the model is less uncertain about the value of each character in the test set. One way to think about this quantity is as the average amount of additional information (in bits) needed by the model to exactly identify the value of each character in the test set. So, a lower measure is better, indicating that the model achieves a good representation of the underlying data. (it is sometimes helpful to think of a language model as a compressed representation of a text corpus). <br />
<br />
As illustrated in the table below, the MRNN achieves a lower number of bits per character than the hierarchical bayesian model, but a higher number of bits per character than the PAQ model (which recall, is not a strictly character level model). The numbers in brackets indicate the bits per character achieved on the training data, and the column labelled 'Full Set' reports the results of training the MRNN on the full wikipedia and NYT corpora. <br />
<br />
[[File:bits.png | frame | centre | Bits per character achieved by each model on each dataset.]]<br />
<br />
These results indicate that the MRNN beat the existing state-of-the-art for pure character-level language modelling at the time of publication. <br />
<br />
= Qualitative Experiments =<br />
<br />
By examining the output of the MRNN, it is possible to see what kinds of linguistic patterns it is able to learn. Most striking is the fact that the model consistently produces correct words from a fairly sophisticated vocabulary. The model is also able to balance parentheses and quotation marks over many time steps, and it occasionally produces plausible non-words such as 'cryptoliation' and 'homosomalist'. The text in the figure below was produced by running the model in generative mode less than 10 times using the phrase 'The meaning of life is' as an initial input, and then selecting the most interesting output sequence. The model was trained on wikipedia to produce the results in the figure below. The character '?' indicates an unknown item, and some of the spacing and punctuation oddities are due to preprocessing and are apparently common in the dataset. <br />
<br />
[[File:text.png | frame | centre | A selection of text generated by an MRNN initialized with the sequence "The meaning of life is...".]]<br />
<br />
Another interesting qualitative demonstration of the model's abilities involves initializing the model with a more complicated sequence and seeing what sort of continuations it produces. In the figure below, a number of sampled continuations of the phrase 'England, Spain, France, Germany' are shown. Generally, the model is able to provide continuations that preserve the list-like structure of the phrase. Moreover, the model is also able to recognize that the list is a list of locations, and typically offers additional locations as its predicted continuation of the sequence. <br />
<br />
[[File:locations.png | frame | centre | Selections of text generated by an MRNN initialized with the sequence "England, Spain, France, Germany".]]<br />
<br />
What is particularly impressive about these results is the fact that the model is learning a distribution over sequences of characters only. From this distribution, a broad range of syntactic and lexical knowledge emerges. It is also worth noting that it is much more efficient to train a model with a small character-level vocabulary than it is to train a model with a word-level vocabulary (which can have tens of thousands of items). As such, the character-level MRNN is able to scale to large datasets quite well.<br />
<br />
Moreover, they find that the MRNN is sensitive to some notations like the initial bracket if such string doesn't occur in the training set. They claim that any method which is based on precise context matches is fundamentally incapable of utilizing long contexts, because the probability that a long context occurs more than once is very small.<br />
<br />
= Discussion =<br />
<br />
One aspect of this work that is worth considering concerns the degree to which the use of input-dependent gating of the information being passed from hidden state to hidden state actually improves the results over and above the use of a standard recurrent neural network. Presumably, the use of hessian free optimization allows one to successfully train such a network, so it would be helpful to see a comparison to the results obtained using an MRNN.MRNNs already learn surprisingly good language models<br />
using only 1500 hidden units, and unlike other approaches such as the sequence memoizer and PAQ, they are easy to extend along various dimensions. Otherwise, it is hard to discern the relative importance of the optimization technique and the network architecture in achieving the good language modelling results reported in this paper.<br />
The MRNN assigns probability to plausible words that do not exist in the training set. This is a good property, that enabled the MRNN to deal with real words that it did not see in the training set. one advantage of this model is that, this model avoids using a huge softmax over all known words by predicting the next word based on a sequence of character predictions, while some word-level language models actually make up binary spellings of words in a way that they can predict them one bit at each time.<br />
<br />
= Bibliography = <br />
<references /></div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=generating_text_with_recurrent_neural_networks&diff=27317generating text with recurrent neural networks2015-12-15T05:07:23Z<p>Trttse: /* The RNN as a Generative Model */</p>
<hr />
<div>= Introduction =<br />
<br />
The goal of this paper is to introduce a new type of recurrent neural network for character-level language modelling that allows the input character at a given timestep to multiplicatively gate the connections that make up the hidden-to-hidden layer weight matrix. The paper also introduces a solution to the problem of vanishing and exploding gradients by applying a technique called Hessian-Free optimization to effectively train a recurrent network that, when unrolled in time, has approximately 500 layers. At the date of publication, this network was arguably the deepest neural network ever trained successfully. <br />
<br />
Strictly speaking, a language model is a probability distribution over sequences of words or characters, and such models are typically used to predict the next character or word in a sequence given some number of preceding characters or words. Recurrent neural networks are naturally applicable to this task, since they make predictions based on a current input and a hidden state whose value is determined by some number of previous inputs. Alternative methods that the authors compare their results to include a hierarchical Bayesian model called a 'sequence memoizer' <ref> Wood, F., C. Archambeau, J. Gasthaus, L. James, and Y.W. The. [http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/airg/readings/2012_02_28_a_stochastic_memoizer.pdf "A Stochastic Memoizer for Sequence Data"] ICML, (2009) </ref> and a mixture of context models referred to as PAQ <ref> Mahoney, M. [https://repository.lib.fit.edu/bitstream/handle/11141/154/cs-2005-16.pdf?sequence=1&isAllowed=y "Adaptive Weighing of Context Models for Lossless Data Compression"], Florida Institute of Technology Technical Report, (2005) </ref>, which actually includes word-level information (rather strictly character-level information). The multiplicative RNN introduced in this paper improves on the state-of-the-art for solely character-level language modelling, but is somewhat worse than the state-of-the-art for text compression. <br />
<br />
To give a brief review, an ordinary recurrent neural network is parameterized by three weight matrices, <math>\ W_{hi} </math>, <math>\ W_{hh} </math>, and <math>\ W_{oh} </math>, and functions to map a sequence of <math> N </math> input states <math>\ [i_1, ... , i_N] </math> to a sequence of hidden states <math>\ [h_1, ... , h_N] </math> and a sequence of output states <math>\ [o_1, ... , o_N] </math>. The matrix <math>\ W_{hi} </math> parameterizes the mapping from the current input state to the current hidden state, while the matrix <math>\ W_{hh} </math> parameterizes the mapping from the previous hidden state to current hidden state, such that the current hidden state is function of the previous hidden state and the current input state. Finally, the matrix <math>\ W_{oh} </math> parameterizes the mapping from the current hidden state to the current output state. So, at a given timestep <math>\ t </math>, the values of the hidden state and output state are as follows:<br />
<br />
<br />
:<math>\ h_t = tanh(W_{hi}i_t + W_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
:<math>\ o_t = W_{oh}h_t + b_o </math> <br />
<br />
<br />
where <math>\ b_o</math> and <math>\ b_h</math> are bias vectors. Typically, the output state is converted into a probability distribution over characters or words using the softmax function. The network can then be treated as a generative model of text by sampling from this distribution and providing the sampled output as the input to the network at the next timestep.<br />
<br />
Recurrent networks are known to be very difficult to train due to the existence a highly unstable relationship between a network's parameters and the gradient of its cost function. Intuitively, the surface of the cost function is intermittently punctuated by abrupt changes (giving rise to exploding gradients) and nearly flat plateaus (giving rise to vanishing gradients) that can effectively become poor local minima when a network is trained through gradient descent. Techniques for improving training include the use of Long Short-Term Memory networks <ref> Hochreiter, Sepp, and Jürgen Schmidhuber. [http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf "Long short-term memory."] Neural computation 9.8 (1997): 1735-1780. </ref>, in which memory units are used to selectively preserve information from previous states, and the use of Echo State networks, <ref> Jaeger, H. and H. Haas. [http://www.sciencemag.org/content/304/5667/78.short "Harnassing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication."] Science, 204.5667 (2004): 78-80. </ref> which learn only the output weights on a network with recurrent connections that implement a wide range of time-varying patterns. In this paper, the method of Hessian free optimization is used instead of these alternatives. <br />
<br />
[[File:RNN.png | frame | centre | A depiction of a recurrent neural network unrolled through three time steps.]]<br />
<br />
= Hessian-Free Optimization = <br />
<br />
While this optimization technique is described elsewhere in Martens (2010) <ref> Martens, J. [http://icml2010.haifa.il.ibm.com/papers/458.pdf "Deep learning via Hessian-free optimization."] ICML, (2010) </ref><br />
, its use is essential to obtaining the successful results reported in this paper. In brief, the technique involves uses information about the 2nd derivatives of the cost function to perform more intelligent parameter updates. This information is helpful because in cases where the gradient is changing very slowly on a particular dimension, it is more efficient to take larger steps in the direction of descent along this dimension. Alternatively, if the the gradient is changing very rapidly on a particular dimension, then it makes sense to take smaller steps to avoid 'bouncing' off of a step incline in the cost function and moving to a less desirable location in parameter space. The relevant 2nd order information is computed using the method of finite differences to avoid computing the Hessian of the cost function.In fact instead of computing and inverting the H matrix when updating equations, the Gauss-Newton approximation is used for the Hessian matrix which is quite good approximation to the Hessian and practically cheaper to compute. <br />
<br />
What is important about this technique is that it provides a solution to problem of vanishing and exploding gradients during the training of recurrent neural networks. Vanishing gradients are accommodated by descending much more rapidly along the cost function in areas where it has relatively low curvature (e.g., when the cost function is nearly flat), while exploding gradients are accommodated by descending much more slowly along the cost function in areas where it has relatively high curvature (e.g., when there is a steep cliff). The figure below illustrates how hessian free optimization improves the training of neural networks in general. <br />
<br />
[[File:HFF.png | frame | centre | On the left is training with naive gradient descent, and on the right is training via the use of 2nd order information about the cost function.]]<br />
<br />
= Multiplicative Recurrent Neural Networks = <br />
<br />
The authors report that using a standard neural network trained via Hessian-free optimization produces only mediocre results. As such, they introduce a new architecture called a multiplicative recurrent neural network (MRNN). The motivating intuition behind this architecture is that the input at a given time step should both additively contribute to the hidden state (though the mapping performed by the input-to-hidden weights) and additionally determine the weights on the recurrent connections to the hidden state. This approach came from viewing an RNN as a model of an tree in which each node is a hidden state vector and each edge is labelled by a character that determines how the parent node gives rise to the child node. In other words, the idea is to define a unique weight matrix <math>\ W_{hh} </math> for each possible input. The reason this design is hypothesized to the improve the predictive adequacy of the model is due to the idea that the ''conjunction'' of the input at one time step and the hidden state at the previous time step is important. Capturing this conjunction requires the input to influence the contribution of the previous hidden state to the current hidden state. Otherwise, the previous hidden state and the current input will make entirely independent contributions to the calculation of the current hidden state. Formally, this changes the calculation of the hidden state at a given time step as follows:<br />
<br />
<br />
:<math>\ h_t = tanh(W_{hi}i_t + W^{i_t}_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
where <math>\ W^{i_t}_{hh} </math> is an input-specific hidden-to-hidden weight matrix. As a first approach to implementing this MRNN, the authors suggest using a tensor of rank 3 to store the hidden-to-hidden weights. The idea is that the tensor stores one weight matrix per possible input; when the input is provided as a one-hot vector, tensor contraction (i.e. a generalization of matrix multiplication) can be used to extract the 'slice' of the tensor that contains the appropriate set of weights. One problem with this approach is that it quickly becomes impractical to store the hidden-to-hidden weights as a tensor if the dimensionality of the hidden state has a large number of dimensions. For instance, if a network's hidden layer encodes a vector with 1000 dimensions, then the number of parameters in the tensor that need to be learned will be equal to <math>\ 1000^2 * N </math>, where <math>\ N </math> is the vocabulary size. In short, this method will add many millions of parameters to a model for a non-trivially sized vocabulary. <br />
<br />
To fix this problem, the tensor is factored using a technique described in Taylor & Hinton (2009) <ref>Taylor, G. and G. Hinton. [http://www.cs.toronto.edu/~fritz/absps/fcrbm_icml.pdf "Factored Conditional Restricted Boltzmann Machines for Modeling Motion Style"] ICML (2009) </ref>. The idea is to define three matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math> that approximate the use of a tensor in determining the value of <math>\ W^{i_t}_{hh} </math> as follows:<br />
<br />
<br />
:<math>\ W^{i_t}_{hh} = W_{hf} \cdot diag(W_{fi}i_t) \cdot W_{fh} </math><br />
<br />
<br />
Intuitively, this factorization produces two vectors from the current input state and the previous hidden state, takes their element-wise product, and applies a linear transformation to produce the input to the hidden layer at the current timestep. The triangle units in the figure below indicate where the element-wise product occurs, and the connections into and out of these units are parameterized by the matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math>. The element-wise multiplication is implemented by diagonalizing the matrix-vector product <math>\ W_{fi}i_t </math>, and if the dimensionality of this matrix-vector product (i.e. the dimensionality of the layer of multiplicative units) is allowed to be arbitrarily large, then this factorization is just as expressive as using a tensor to store the hidden-to-hidden weights. <br />
<br />
[[File:MRNN.png | frame | centre | A depiction of a multiplicative recurrent neural network unrolled through three time steps.]]<br />
<br />
In the experiments described below, an MRNN is trained via Hessian Free optimization on sequences of 250 characters. The first 50 characters used to condition the hidden state, so only 200 predictions are generated per sequence. 1500 hidden units were used, along with 1500 factors (i.e. multiplicative gates, or the triangles in the figure above), yielding an unrolled network of 500 layers if the multiplicative units are treated as forming a layer. Training was performed with a parallelized system consisting of 8 GPUs. A vocabulary of 86 characters was used in all cases.<br />
<br />
= The RNN as a Generative Model =<br />
The goal of the model is to predict the next character given a string of characters. More formally, given a training sequence <math>(x_1,...,x_T)</math>, the RNN uses its output vectors <math>(o_1,...,o_T)</math> to obtain a sequence of predictive distributions <math>P(x_{t+1}|x)</math><br />
<br />
= Quantitative Experiments =<br />
<br />
To compare the performance of the MRNN to that of the sequence memorizer and PAQ, three 100mb datasets were used: a selection of wikipedia articles, a selection of New York Times articles, and a corpus of all available articles published in NIPS and JMLR. The last 10 million characters in each dataset were held out for testing. Additionally, the MRNN was trained on the larger corpora from which the wikipedia text and NYT articles were drawn (i.e. all of wikipedia, and the entire set of NYT articles). <br />
<br />
The models were evaluated by calculating the number of bits per character achieved by each model on the 3 test sets. This metric is essentially a measure of model perplexity, which defines how well a given model predicts the data it is being tested on. If the number of bits per character is high, this means that the model is, on average, highly uncertain about the value of each character in the test set. If the number of bits per character is low, then the model is less uncertain about the value of each character in the test set. One way to think about this quantity is as the average amount of additional information (in bits) needed by the model to exactly identify the value of each character in the test set. So, a lower measure is better, indicating that the model achieves a good representation of the underlying data. (it is sometimes helpful to think of a language model as a compressed representation of a text corpus). <br />
<br />
As illustrated in the table below, the MRNN achieves a lower number of bits per character than the hierarchical bayesian model, but a higher number of bits per character than the PAQ model (which recall, is not a strictly character level model). The numbers in brackets indicate the bits per character achieved on the training data, and the column labelled 'Full Set' reports the results of training the MRNN on the full wikipedia and NYT corpora. <br />
<br />
[[File:bits.png | frame | centre | Bits per character achieved by each model on each dataset.]]<br />
<br />
These results indicate that the MRNN beat the existing state-of-the-art for pure character-level language modelling at the time of publication. <br />
<br />
= Qualitative Experiments =<br />
<br />
By examining the output of the MRNN, it is possible to see what kinds of linguistic patterns it is able to learn. Most striking is the fact that the model consistently produces correct words from a fairly sophisticated vocabulary. The model is also able to balance parentheses and quotation marks over many time steps, and it occasionally produces plausible non-words such as 'cryptoliation' and 'homosomalist'. The text in the figure below was produced by running the model in generative mode less than 10 times using the phrase 'The meaning of life is' as an initial input, and then selecting the most interesting output sequence. The model was trained on wikipedia to produce the results in the figure below. The character '?' indicates an unknown item, and some of the spacing and punctuation oddities are due to preprocessing and are apparently common in the dataset. <br />
<br />
[[File:text.png | frame | centre | A selection of text generated by an MRNN initialized with the sequence "The meaning of life is...".]]<br />
<br />
Another interesting qualitative demonstration of the model's abilities involves initializing the model with a more complicated sequence and seeing what sort of continuations it produces. In the figure below, a number of sampled continuations of the phrase 'England, Spain, France, Germany' are shown. Generally, the model is able to provide continuations that preserve the list-like structure of the phrase. Moreover, the model is also able to recognize that the list is a list of locations, and typically offers additional locations as its predicted continuation of the sequence. <br />
<br />
[[File:locations.png | frame | centre | Selections of text generated by an MRNN initialized with the sequence "England, Spain, France, Germany".]]<br />
<br />
What is particularly impressive about these results is the fact that the model is learning a distribution over sequences of characters only. From this distribution, a broad range of syntactic and lexical knowledge emerges. It is also worth noting that it is much more efficient to train a model with a small character-level vocabulary than it is to train a model with a word-level vocabulary (which can have tens of thousands of items). As such, the character-level MRNN is able to scale to large datasets quite well.<br />
<br />
Moreover, they find that the MRNN is sensitive to some notations like the initial bracket if such string doesn't occur in the training set. They claim that any method which is based on precise context matches is fundamentally incapable of utilizing long contexts, because the probability that a long context occurs more than once is very small.<br />
<br />
= Discussion =<br />
<br />
One aspect of this work that is worth considering concerns the degree to which the use of input-dependent gating of the information being passed from hidden state to hidden state actually improves the results over and above the use of a standard recurrent neural network. Presumably, the use of hessian free optimization allows one to successfully train such a network, so it would be helpful to see a comparison to the results obtained using an MRNN.MRNNs already learn surprisingly good language models<br />
using only 1500 hidden units, and unlike other approaches such as the sequence memoizer and PAQ, they are easy to extend along various dimensions. Otherwise, it is hard to discern the relative importance of the optimization technique and the network architecture in achieving the good language modelling results reported in this paper.<br />
The MRNN assigns probability to plausible words that do not exist in the training set. This is a good property, that enabled the MRNN to deal with real words that it did not see in the training set. one advantage of this model is that, this model avoids using a huge softmax over all known words by predicting the next word based on a sequence of character predictions, while some word-level language models actually make up binary spellings of words in a way that they can predict them one bit at each time.<br />
<br />
= Bibliography = <br />
<references /></div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=generating_text_with_recurrent_neural_networks&diff=27316generating text with recurrent neural networks2015-12-15T05:06:34Z<p>Trttse: /* The RNN as a Generative Model */</p>
<hr />
<div>= Introduction =<br />
<br />
The goal of this paper is to introduce a new type of recurrent neural network for character-level language modelling that allows the input character at a given timestep to multiplicatively gate the connections that make up the hidden-to-hidden layer weight matrix. The paper also introduces a solution to the problem of vanishing and exploding gradients by applying a technique called Hessian-Free optimization to effectively train a recurrent network that, when unrolled in time, has approximately 500 layers. At the date of publication, this network was arguably the deepest neural network ever trained successfully. <br />
<br />
Strictly speaking, a language model is a probability distribution over sequences of words or characters, and such models are typically used to predict the next character or word in a sequence given some number of preceding characters or words. Recurrent neural networks are naturally applicable to this task, since they make predictions based on a current input and a hidden state whose value is determined by some number of previous inputs. Alternative methods that the authors compare their results to include a hierarchical Bayesian model called a 'sequence memoizer' <ref> Wood, F., C. Archambeau, J. Gasthaus, L. James, and Y.W. The. [http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/airg/readings/2012_02_28_a_stochastic_memoizer.pdf "A Stochastic Memoizer for Sequence Data"] ICML, (2009) </ref> and a mixture of context models referred to as PAQ <ref> Mahoney, M. [https://repository.lib.fit.edu/bitstream/handle/11141/154/cs-2005-16.pdf?sequence=1&isAllowed=y "Adaptive Weighing of Context Models for Lossless Data Compression"], Florida Institute of Technology Technical Report, (2005) </ref>, which actually includes word-level information (rather strictly character-level information). The multiplicative RNN introduced in this paper improves on the state-of-the-art for solely character-level language modelling, but is somewhat worse than the state-of-the-art for text compression. <br />
<br />
To give a brief review, an ordinary recurrent neural network is parameterized by three weight matrices, <math>\ W_{hi} </math>, <math>\ W_{hh} </math>, and <math>\ W_{oh} </math>, and functions to map a sequence of <math> N </math> input states <math>\ [i_1, ... , i_N] </math> to a sequence of hidden states <math>\ [h_1, ... , h_N] </math> and a sequence of output states <math>\ [o_1, ... , o_N] </math>. The matrix <math>\ W_{hi} </math> parameterizes the mapping from the current input state to the current hidden state, while the matrix <math>\ W_{hh} </math> parameterizes the mapping from the previous hidden state to current hidden state, such that the current hidden state is function of the previous hidden state and the current input state. Finally, the matrix <math>\ W_{oh} </math> parameterizes the mapping from the current hidden state to the current output state. So, at a given timestep <math>\ t </math>, the values of the hidden state and output state are as follows:<br />
<br />
<br />
:<math>\ h_t = tanh(W_{hi}i_t + W_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
:<math>\ o_t = W_{oh}h_t + b_o </math> <br />
<br />
<br />
where <math>\ b_o</math> and <math>\ b_h</math> are bias vectors. Typically, the output state is converted into a probability distribution over characters or words using the softmax function. The network can then be treated as a generative model of text by sampling from this distribution and providing the sampled output as the input to the network at the next timestep.<br />
<br />
Recurrent networks are known to be very difficult to train due to the existence a highly unstable relationship between a network's parameters and the gradient of its cost function. Intuitively, the surface of the cost function is intermittently punctuated by abrupt changes (giving rise to exploding gradients) and nearly flat plateaus (giving rise to vanishing gradients) that can effectively become poor local minima when a network is trained through gradient descent. Techniques for improving training include the use of Long Short-Term Memory networks <ref> Hochreiter, Sepp, and Jürgen Schmidhuber. [http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf "Long short-term memory."] Neural computation 9.8 (1997): 1735-1780. </ref>, in which memory units are used to selectively preserve information from previous states, and the use of Echo State networks, <ref> Jaeger, H. and H. Haas. [http://www.sciencemag.org/content/304/5667/78.short "Harnassing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication."] Science, 204.5667 (2004): 78-80. </ref> which learn only the output weights on a network with recurrent connections that implement a wide range of time-varying patterns. In this paper, the method of Hessian free optimization is used instead of these alternatives. <br />
<br />
[[File:RNN.png | frame | centre | A depiction of a recurrent neural network unrolled through three time steps.]]<br />
<br />
= Hessian-Free Optimization = <br />
<br />
While this optimization technique is described elsewhere in Martens (2010) <ref> Martens, J. [http://icml2010.haifa.il.ibm.com/papers/458.pdf "Deep learning via Hessian-free optimization."] ICML, (2010) </ref><br />
, its use is essential to obtaining the successful results reported in this paper. In brief, the technique involves uses information about the 2nd derivatives of the cost function to perform more intelligent parameter updates. This information is helpful because in cases where the gradient is changing very slowly on a particular dimension, it is more efficient to take larger steps in the direction of descent along this dimension. Alternatively, if the the gradient is changing very rapidly on a particular dimension, then it makes sense to take smaller steps to avoid 'bouncing' off of a step incline in the cost function and moving to a less desirable location in parameter space. The relevant 2nd order information is computed using the method of finite differences to avoid computing the Hessian of the cost function.In fact instead of computing and inverting the H matrix when updating equations, the Gauss-Newton approximation is used for the Hessian matrix which is quite good approximation to the Hessian and practically cheaper to compute. <br />
<br />
What is important about this technique is that it provides a solution to problem of vanishing and exploding gradients during the training of recurrent neural networks. Vanishing gradients are accommodated by descending much more rapidly along the cost function in areas where it has relatively low curvature (e.g., when the cost function is nearly flat), while exploding gradients are accommodated by descending much more slowly along the cost function in areas where it has relatively high curvature (e.g., when there is a steep cliff). The figure below illustrates how hessian free optimization improves the training of neural networks in general. <br />
<br />
[[File:HFF.png | frame | centre | On the left is training with naive gradient descent, and on the right is training via the use of 2nd order information about the cost function.]]<br />
<br />
= Multiplicative Recurrent Neural Networks = <br />
<br />
The authors report that using a standard neural network trained via Hessian-free optimization produces only mediocre results. As such, they introduce a new architecture called a multiplicative recurrent neural network (MRNN). The motivating intuition behind this architecture is that the input at a given time step should both additively contribute to the hidden state (though the mapping performed by the input-to-hidden weights) and additionally determine the weights on the recurrent connections to the hidden state. This approach came from viewing an RNN as a model of an tree in which each node is a hidden state vector and each edge is labelled by a character that determines how the parent node gives rise to the child node. In other words, the idea is to define a unique weight matrix <math>\ W_{hh} </math> for each possible input. The reason this design is hypothesized to the improve the predictive adequacy of the model is due to the idea that the ''conjunction'' of the input at one time step and the hidden state at the previous time step is important. Capturing this conjunction requires the input to influence the contribution of the previous hidden state to the current hidden state. Otherwise, the previous hidden state and the current input will make entirely independent contributions to the calculation of the current hidden state. Formally, this changes the calculation of the hidden state at a given time step as follows:<br />
<br />
<br />
:<math>\ h_t = tanh(W_{hi}i_t + W^{i_t}_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
where <math>\ W^{i_t}_{hh} </math> is an input-specific hidden-to-hidden weight matrix. As a first approach to implementing this MRNN, the authors suggest using a tensor of rank 3 to store the hidden-to-hidden weights. The idea is that the tensor stores one weight matrix per possible input; when the input is provided as a one-hot vector, tensor contraction (i.e. a generalization of matrix multiplication) can be used to extract the 'slice' of the tensor that contains the appropriate set of weights. One problem with this approach is that it quickly becomes impractical to store the hidden-to-hidden weights as a tensor if the dimensionality of the hidden state has a large number of dimensions. For instance, if a network's hidden layer encodes a vector with 1000 dimensions, then the number of parameters in the tensor that need to be learned will be equal to <math>\ 1000^2 * N </math>, where <math>\ N </math> is the vocabulary size. In short, this method will add many millions of parameters to a model for a non-trivially sized vocabulary. <br />
<br />
To fix this problem, the tensor is factored using a technique described in Taylor & Hinton (2009) <ref>Taylor, G. and G. Hinton. [http://www.cs.toronto.edu/~fritz/absps/fcrbm_icml.pdf "Factored Conditional Restricted Boltzmann Machines for Modeling Motion Style"] ICML (2009) </ref>. The idea is to define three matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math> that approximate the use of a tensor in determining the value of <math>\ W^{i_t}_{hh} </math> as follows:<br />
<br />
<br />
:<math>\ W^{i_t}_{hh} = W_{hf} \cdot diag(W_{fi}i_t) \cdot W_{fh} </math><br />
<br />
<br />
Intuitively, this factorization produces two vectors from the current input state and the previous hidden state, takes their element-wise product, and applies a linear transformation to produce the input to the hidden layer at the current timestep. The triangle units in the figure below indicate where the element-wise product occurs, and the connections into and out of these units are parameterized by the matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math>. The element-wise multiplication is implemented by diagonalizing the matrix-vector product <math>\ W_{fi}i_t </math>, and if the dimensionality of this matrix-vector product (i.e. the dimensionality of the layer of multiplicative units) is allowed to be arbitrarily large, then this factorization is just as expressive as using a tensor to store the hidden-to-hidden weights. <br />
<br />
[[File:MRNN.png | frame | centre | A depiction of a multiplicative recurrent neural network unrolled through three time steps.]]<br />
<br />
In the experiments described below, an MRNN is trained via Hessian Free optimization on sequences of 250 characters. The first 50 characters used to condition the hidden state, so only 200 predictions are generated per sequence. 1500 hidden units were used, along with 1500 factors (i.e. multiplicative gates, or the triangles in the figure above), yielding an unrolled network of 500 layers if the multiplicative units are treated as forming a layer. Training was performed with a parallelized system consisting of 8 GPUs. A vocabulary of 86 characters was used in all cases.<br />
<br />
= The RNN as a Generative Model =<br />
The goal of the model is to predict the next character given a string of characters. More formally, given a training sequence <math>(x_1,...,x_T)</math>, the RNN uses its output vectors <math>(o_1,...,o_T)</math> to obtain a sequence of predictive distributions<br />
<br />
= Quantitative Experiments =<br />
<br />
To compare the performance of the MRNN to that of the sequence memorizer and PAQ, three 100mb datasets were used: a selection of wikipedia articles, a selection of New York Times articles, and a corpus of all available articles published in NIPS and JMLR. The last 10 million characters in each dataset were held out for testing. Additionally, the MRNN was trained on the larger corpora from which the wikipedia text and NYT articles were drawn (i.e. all of wikipedia, and the entire set of NYT articles). <br />
<br />
The models were evaluated by calculating the number of bits per character achieved by each model on the 3 test sets. This metric is essentially a measure of model perplexity, which defines how well a given model predicts the data it is being tested on. If the number of bits per character is high, this means that the model is, on average, highly uncertain about the value of each character in the test set. If the number of bits per character is low, then the model is less uncertain about the value of each character in the test set. One way to think about this quantity is as the average amount of additional information (in bits) needed by the model to exactly identify the value of each character in the test set. So, a lower measure is better, indicating that the model achieves a good representation of the underlying data. (it is sometimes helpful to think of a language model as a compressed representation of a text corpus). <br />
<br />
As illustrated in the table below, the MRNN achieves a lower number of bits per character than the hierarchical bayesian model, but a higher number of bits per character than the PAQ model (which recall, is not a strictly character level model). The numbers in brackets indicate the bits per character achieved on the training data, and the column labelled 'Full Set' reports the results of training the MRNN on the full wikipedia and NYT corpora. <br />
<br />
[[File:bits.png | frame | centre | Bits per character achieved by each model on each dataset.]]<br />
<br />
These results indicate that the MRNN beat the existing state-of-the-art for pure character-level language modelling at the time of publication. <br />
<br />
= Qualitative Experiments =<br />
<br />
By examining the output of the MRNN, it is possible to see what kinds of linguistic patterns it is able to learn. Most striking is the fact that the model consistently produces correct words from a fairly sophisticated vocabulary. The model is also able to balance parentheses and quotation marks over many time steps, and it occasionally produces plausible non-words such as 'cryptoliation' and 'homosomalist'. The text in the figure below was produced by running the model in generative mode less than 10 times using the phrase 'The meaning of life is' as an initial input, and then selecting the most interesting output sequence. The model was trained on wikipedia to produce the results in the figure below. The character '?' indicates an unknown item, and some of the spacing and punctuation oddities are due to preprocessing and are apparently common in the dataset. <br />
<br />
[[File:text.png | frame | centre | A selection of text generated by an MRNN initialized with the sequence "The meaning of life is...".]]<br />
<br />
Another interesting qualitative demonstration of the model's abilities involves initializing the model with a more complicated sequence and seeing what sort of continuations it produces. In the figure below, a number of sampled continuations of the phrase 'England, Spain, France, Germany' are shown. Generally, the model is able to provide continuations that preserve the list-like structure of the phrase. Moreover, the model is also able to recognize that the list is a list of locations, and typically offers additional locations as its predicted continuation of the sequence. <br />
<br />
[[File:locations.png | frame | centre | Selections of text generated by an MRNN initialized with the sequence "England, Spain, France, Germany".]]<br />
<br />
What is particularly impressive about these results is the fact that the model is learning a distribution over sequences of characters only. From this distribution, a broad range of syntactic and lexical knowledge emerges. It is also worth noting that it is much more efficient to train a model with a small character-level vocabulary than it is to train a model with a word-level vocabulary (which can have tens of thousands of items). As such, the character-level MRNN is able to scale to large datasets quite well.<br />
<br />
Moreover, they find that the MRNN is sensitive to some notations like the initial bracket if such string doesn't occur in the training set. They claim that any method which is based on precise context matches is fundamentally incapable of utilizing long contexts, because the probability that a long context occurs more than once is very small.<br />
<br />
= Discussion =<br />
<br />
One aspect of this work that is worth considering concerns the degree to which the use of input-dependent gating of the information being passed from hidden state to hidden state actually improves the results over and above the use of a standard recurrent neural network. Presumably, the use of hessian free optimization allows one to successfully train such a network, so it would be helpful to see a comparison to the results obtained using an MRNN.MRNNs already learn surprisingly good language models<br />
using only 1500 hidden units, and unlike other approaches such as the sequence memoizer and PAQ, they are easy to extend along various dimensions. Otherwise, it is hard to discern the relative importance of the optimization technique and the network architecture in achieving the good language modelling results reported in this paper.<br />
The MRNN assigns probability to plausible words that do not exist in the training set. This is a good property, that enabled the MRNN to deal with real words that it did not see in the training set. one advantage of this model is that, this model avoids using a huge softmax over all known words by predicting the next word based on a sequence of character predictions, while some word-level language models actually make up binary spellings of words in a way that they can predict them one bit at each time.<br />
<br />
= Bibliography = <br />
<references /></div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=generating_text_with_recurrent_neural_networks&diff=27315generating text with recurrent neural networks2015-12-15T05:02:19Z<p>Trttse: </p>
<hr />
<div>= Introduction =<br />
<br />
The goal of this paper is to introduce a new type of recurrent neural network for character-level language modelling that allows the input character at a given timestep to multiplicatively gate the connections that make up the hidden-to-hidden layer weight matrix. The paper also introduces a solution to the problem of vanishing and exploding gradients by applying a technique called Hessian-Free optimization to effectively train a recurrent network that, when unrolled in time, has approximately 500 layers. At the date of publication, this network was arguably the deepest neural network ever trained successfully. <br />
<br />
Strictly speaking, a language model is a probability distribution over sequences of words or characters, and such models are typically used to predict the next character or word in a sequence given some number of preceding characters or words. Recurrent neural networks are naturally applicable to this task, since they make predictions based on a current input and a hidden state whose value is determined by some number of previous inputs. Alternative methods that the authors compare their results to include a hierarchical Bayesian model called a 'sequence memoizer' <ref> Wood, F., C. Archambeau, J. Gasthaus, L. James, and Y.W. The. [http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/airg/readings/2012_02_28_a_stochastic_memoizer.pdf "A Stochastic Memoizer for Sequence Data"] ICML, (2009) </ref> and a mixture of context models referred to as PAQ <ref> Mahoney, M. [https://repository.lib.fit.edu/bitstream/handle/11141/154/cs-2005-16.pdf?sequence=1&isAllowed=y "Adaptive Weighing of Context Models for Lossless Data Compression"], Florida Institute of Technology Technical Report, (2005) </ref>, which actually includes word-level information (rather strictly character-level information). The multiplicative RNN introduced in this paper improves on the state-of-the-art for solely character-level language modelling, but is somewhat worse than the state-of-the-art for text compression. <br />
<br />
To give a brief review, an ordinary recurrent neural network is parameterized by three weight matrices, <math>\ W_{hi} </math>, <math>\ W_{hh} </math>, and <math>\ W_{oh} </math>, and functions to map a sequence of <math> N </math> input states <math>\ [i_1, ... , i_N] </math> to a sequence of hidden states <math>\ [h_1, ... , h_N] </math> and a sequence of output states <math>\ [o_1, ... , o_N] </math>. The matrix <math>\ W_{hi} </math> parameterizes the mapping from the current input state to the current hidden state, while the matrix <math>\ W_{hh} </math> parameterizes the mapping from the previous hidden state to current hidden state, such that the current hidden state is function of the previous hidden state and the current input state. Finally, the matrix <math>\ W_{oh} </math> parameterizes the mapping from the current hidden state to the current output state. So, at a given timestep <math>\ t </math>, the values of the hidden state and output state are as follows:<br />
<br />
<br />
:<math>\ h_t = tanh(W_{hi}i_t + W_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
:<math>\ o_t = W_{oh}h_t + b_o </math> <br />
<br />
<br />
where <math>\ b_o</math> and <math>\ b_h</math> are bias vectors. Typically, the output state is converted into a probability distribution over characters or words using the softmax function. The network can then be treated as a generative model of text by sampling from this distribution and providing the sampled output as the input to the network at the next timestep.<br />
<br />
Recurrent networks are known to be very difficult to train due to the existence a highly unstable relationship between a network's parameters and the gradient of its cost function. Intuitively, the surface of the cost function is intermittently punctuated by abrupt changes (giving rise to exploding gradients) and nearly flat plateaus (giving rise to vanishing gradients) that can effectively become poor local minima when a network is trained through gradient descent. Techniques for improving training include the use of Long Short-Term Memory networks <ref> Hochreiter, Sepp, and Jürgen Schmidhuber. [http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf "Long short-term memory."] Neural computation 9.8 (1997): 1735-1780. </ref>, in which memory units are used to selectively preserve information from previous states, and the use of Echo State networks, <ref> Jaeger, H. and H. Haas. [http://www.sciencemag.org/content/304/5667/78.short "Harnassing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication."] Science, 204.5667 (2004): 78-80. </ref> which learn only the output weights on a network with recurrent connections that implement a wide range of time-varying patterns. In this paper, the method of Hessian free optimization is used instead of these alternatives. <br />
<br />
[[File:RNN.png | frame | centre | A depiction of a recurrent neural network unrolled through three time steps.]]<br />
<br />
= Hessian-Free Optimization = <br />
<br />
While this optimization technique is described elsewhere in Martens (2010) <ref> Martens, J. [http://icml2010.haifa.il.ibm.com/papers/458.pdf "Deep learning via Hessian-free optimization."] ICML, (2010) </ref><br />
, its use is essential to obtaining the successful results reported in this paper. In brief, the technique involves uses information about the 2nd derivatives of the cost function to perform more intelligent parameter updates. This information is helpful because in cases where the gradient is changing very slowly on a particular dimension, it is more efficient to take larger steps in the direction of descent along this dimension. Alternatively, if the the gradient is changing very rapidly on a particular dimension, then it makes sense to take smaller steps to avoid 'bouncing' off of a step incline in the cost function and moving to a less desirable location in parameter space. The relevant 2nd order information is computed using the method of finite differences to avoid computing the Hessian of the cost function.In fact instead of computing and inverting the H matrix when updating equations, the Gauss-Newton approximation is used for the Hessian matrix which is quite good approximation to the Hessian and practically cheaper to compute. <br />
<br />
What is important about this technique is that it provides a solution to problem of vanishing and exploding gradients during the training of recurrent neural networks. Vanishing gradients are accommodated by descending much more rapidly along the cost function in areas where it has relatively low curvature (e.g., when the cost function is nearly flat), while exploding gradients are accommodated by descending much more slowly along the cost function in areas where it has relatively high curvature (e.g., when there is a steep cliff). The figure below illustrates how hessian free optimization improves the training of neural networks in general. <br />
<br />
[[File:HFF.png | frame | centre | On the left is training with naive gradient descent, and on the right is training via the use of 2nd order information about the cost function.]]<br />
<br />
= Multiplicative Recurrent Neural Networks = <br />
<br />
The authors report that using a standard neural network trained via Hessian-free optimization produces only mediocre results. As such, they introduce a new architecture called a multiplicative recurrent neural network (MRNN). The motivating intuition behind this architecture is that the input at a given time step should both additively contribute to the hidden state (though the mapping performed by the input-to-hidden weights) and additionally determine the weights on the recurrent connections to the hidden state. This approach came from viewing an RNN as a model of an tree in which each node is a hidden state vector and each edge is labelled by a character that determines how the parent node gives rise to the child node. In other words, the idea is to define a unique weight matrix <math>\ W_{hh} </math> for each possible input. The reason this design is hypothesized to the improve the predictive adequacy of the model is due to the idea that the ''conjunction'' of the input at one time step and the hidden state at the previous time step is important. Capturing this conjunction requires the input to influence the contribution of the previous hidden state to the current hidden state. Otherwise, the previous hidden state and the current input will make entirely independent contributions to the calculation of the current hidden state. Formally, this changes the calculation of the hidden state at a given time step as follows:<br />
<br />
<br />
:<math>\ h_t = tanh(W_{hi}i_t + W^{i_t}_{hh}h_{t-1} + b_h) </math><br />
<br />
<br />
where <math>\ W^{i_t}_{hh} </math> is an input-specific hidden-to-hidden weight matrix. As a first approach to implementing this MRNN, the authors suggest using a tensor of rank 3 to store the hidden-to-hidden weights. The idea is that the tensor stores one weight matrix per possible input; when the input is provided as a one-hot vector, tensor contraction (i.e. a generalization of matrix multiplication) can be used to extract the 'slice' of the tensor that contains the appropriate set of weights. One problem with this approach is that it quickly becomes impractical to store the hidden-to-hidden weights as a tensor if the dimensionality of the hidden state has a large number of dimensions. For instance, if a network's hidden layer encodes a vector with 1000 dimensions, then the number of parameters in the tensor that need to be learned will be equal to <math>\ 1000^2 * N </math>, where <math>\ N </math> is the vocabulary size. In short, this method will add many millions of parameters to a model for a non-trivially sized vocabulary. <br />
<br />
To fix this problem, the tensor is factored using a technique described in Taylor & Hinton (2009) <ref>Taylor, G. and G. Hinton. [http://www.cs.toronto.edu/~fritz/absps/fcrbm_icml.pdf "Factored Conditional Restricted Boltzmann Machines for Modeling Motion Style"] ICML (2009) </ref>. The idea is to define three matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math> that approximate the use of a tensor in determining the value of <math>\ W^{i_t}_{hh} </math> as follows:<br />
<br />
<br />
:<math>\ W^{i_t}_{hh} = W_{hf} \cdot diag(W_{fi}i_t) \cdot W_{fh} </math><br />
<br />
<br />
Intuitively, this factorization produces two vectors from the current input state and the previous hidden state, takes their element-wise product, and applies a linear transformation to produce the input to the hidden layer at the current timestep. The triangle units in the figure below indicate where the element-wise product occurs, and the connections into and out of these units are parameterized by the matrices <math>\ W_{fh} </math>, <math>\ W_{fi} </math>, and <math>\ W_{hf} </math>. The element-wise multiplication is implemented by diagonalizing the matrix-vector product <math>\ W_{fi}i_t </math>, and if the dimensionality of this matrix-vector product (i.e. the dimensionality of the layer of multiplicative units) is allowed to be arbitrarily large, then this factorization is just as expressive as using a tensor to store the hidden-to-hidden weights. <br />
<br />
[[File:MRNN.png | frame | centre | A depiction of a multiplicative recurrent neural network unrolled through three time steps.]]<br />
<br />
In the experiments described below, an MRNN is trained via Hessian Free optimization on sequences of 250 characters. The first 50 characters used to condition the hidden state, so only 200 predictions are generated per sequence. 1500 hidden units were used, along with 1500 factors (i.e. multiplicative gates, or the triangles in the figure above), yielding an unrolled network of 500 layers if the multiplicative units are treated as forming a layer. Training was performed with a parallelized system consisting of 8 GPUs. A vocabulary of 86 characters was used in all cases.<br />
<br />
= The RNN as a Generative Model =<br />
<br />
= Quantitative Experiments =<br />
<br />
To compare the performance of the MRNN to that of the sequence memorizer and PAQ, three 100mb datasets were used: a selection of wikipedia articles, a selection of New York Times articles, and a corpus of all available articles published in NIPS and JMLR. The last 10 million characters in each dataset were held out for testing. Additionally, the MRNN was trained on the larger corpora from which the wikipedia text and NYT articles were drawn (i.e. all of wikipedia, and the entire set of NYT articles). <br />
<br />
The models were evaluated by calculating the number of bits per character achieved by each model on the 3 test sets. This metric is essentially a measure of model perplexity, which defines how well a given model predicts the data it is being tested on. If the number of bits per character is high, this means that the model is, on average, highly uncertain about the value of each character in the test set. If the number of bits per character is low, then the model is less uncertain about the value of each character in the test set. One way to think about this quantity is as the average amount of additional information (in bits) needed by the model to exactly identify the value of each character in the test set. So, a lower measure is better, indicating that the model achieves a good representation of the underlying data. (it is sometimes helpful to think of a language model as a compressed representation of a text corpus). <br />
<br />
As illustrated in the table below, the MRNN achieves a lower number of bits per character than the hierarchical bayesian model, but a higher number of bits per character than the PAQ model (which recall, is not a strictly character level model). The numbers in brackets indicate the bits per character achieved on the training data, and the column labelled 'Full Set' reports the results of training the MRNN on the full wikipedia and NYT corpora. <br />
<br />
[[File:bits.png | frame | centre | Bits per character achieved by each model on each dataset.]]<br />
<br />
These results indicate that the MRNN beat the existing state-of-the-art for pure character-level language modelling at the time of publication. <br />
<br />
= Qualitative Experiments =<br />
<br />
By examining the output of the MRNN, it is possible to see what kinds of linguistic patterns it is able to learn. Most striking is the fact that the model consistently produces correct words from a fairly sophisticated vocabulary. The model is also able to balance parentheses and quotation marks over many time steps, and it occasionally produces plausible non-words such as 'cryptoliation' and 'homosomalist'. The text in the figure below was produced by running the model in generative mode less than 10 times using the phrase 'The meaning of life is' as an initial input, and then selecting the most interesting output sequence. The model was trained on wikipedia to produce the results in the figure below. The character '?' indicates an unknown item, and some of the spacing and punctuation oddities are due to preprocessing and are apparently common in the dataset. <br />
<br />
[[File:text.png | frame | centre | A selection of text generated by an MRNN initialized with the sequence "The meaning of life is...".]]<br />
<br />
Another interesting qualitative demonstration of the model's abilities involves initializing the model with a more complicated sequence and seeing what sort of continuations it produces. In the figure below, a number of sampled continuations of the phrase 'England, Spain, France, Germany' are shown. Generally, the model is able to provide continuations that preserve the list-like structure of the phrase. Moreover, the model is also able to recognize that the list is a list of locations, and typically offers additional locations as its predicted continuation of the sequence. <br />
<br />
[[File:locations.png | frame | centre | Selections of text generated by an MRNN initialized with the sequence "England, Spain, France, Germany".]]<br />
<br />
What is particularly impressive about these results is the fact that the model is learning a distribution over sequences of characters only. From this distribution, a broad range of syntactic and lexical knowledge emerges. It is also worth noting that it is much more efficient to train a model with a small character-level vocabulary than it is to train a model with a word-level vocabulary (which can have tens of thousands of items). As such, the character-level MRNN is able to scale to large datasets quite well.<br />
<br />
Moreover, they find that the MRNN is sensitive to some notations like the initial bracket if such string doesn't occur in the training set. They claim that any method which is based on precise context matches is fundamentally incapable of utilizing long contexts, because the probability that a long context occurs more than once is very small.<br />
<br />
= Discussion =<br />
<br />
One aspect of this work that is worth considering concerns the degree to which the use of input-dependent gating of the information being passed from hidden state to hidden state actually improves the results over and above the use of a standard recurrent neural network. Presumably, the use of hessian free optimization allows one to successfully train such a network, so it would be helpful to see a comparison to the results obtained using an MRNN.MRNNs already learn surprisingly good language models<br />
using only 1500 hidden units, and unlike other approaches such as the sequence memoizer and PAQ, they are easy to extend along various dimensions. Otherwise, it is hard to discern the relative importance of the optimization technique and the network architecture in achieving the good language modelling results reported in this paper.<br />
The MRNN assigns probability to plausible words that do not exist in the training set. This is a good property, that enabled the MRNN to deal with real words that it did not see in the training set. one advantage of this model is that, this model avoids using a huge softmax over all known words by predicting the next word based on a sequence of character predictions, while some word-level language models actually make up binary spellings of words in a way that they can predict them one bit at each time.<br />
<br />
= Bibliography = <br />
<references /></div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=neural_Turing_Machines&diff=27111neural Turing Machines2015-12-06T22:53:09Z<p>Trttse: /* Experiments */</p>
<hr />
<div>= Neural Turing Machines =<br />
<br />
Even though recurrent neural networks (RNNs) are [https://en.wikipedia.org/wiki/Turing_completeness Turing complete] in theory, the control of logical flow and usage of external memory have been largely ignored in the machine learning literature. This might be due to the fact that the RNNs have to be wired properly to achieve the Turing completeness and this is not necessarily easy to achieve in practice. By adding an addressable memory Graves et al. try to overcome this limitation and name their approach Neural Turing Machine (NTM) as analogy to [https://en.wikipedia.org/wiki/Turing_machine Turing machines] that are finite-state machines extended with an infinite memory. Furthermore, every component of an NTM is differentiable and can, thus, be learned.<br />
<br />
== Theoretical Background == <br />
<br />
The authors state that the design of the NTM is inspired by past research spanning the disciplines of neuroscience, psychology, cognitive science and linguistics, and that the NTM can be thought of as a working memory system of the sort described in various accounts of cognitive architecture. However, the authors propose to ignore the known capacity limitations of working memory, and to introduce sophisticated gating and memory addressing operations that are typically absent in models of sort developed throughout the computational neuroscience literature. <br />
<br />
With respect to historical precedents in the cognitive science and linguistics literature, the authors situate their work in relation to a longstanding debate concerning the effectiveness of neural networks for cognitive modelling. They present their work as continuing and advancing a line of research on encoding recursively structured representations in neural networks that stemmed out of criticisms presented by Fodor and Pylyshyn in 1988 (though it is worth pointing out the authors give an incorrect summary of these criticisms - they state that Fodor and Pylyshyn argued that neural networks could not implement variable binding or perform tasks involving variable length structures, when in fact they argued that successful models of cognition require representations with constituent structure and processing mechanisms that strictly structure sensitive - see [http://www.sciantaanalytics.com/sites/default/files/fodor-pylyshyn.pdf the paper] for details). The NTM is able to deal variable length inputs and arguably performs variable binding in the sense that the memory slots in the NTM can be treated as variables to which data is bound, but the authors do not revisit these issues in any detail after presenting the results of their simulations with the NTM.<br />
<br />
= Architecture =<br />
<br />
A Neural Turing Machine consists of a memory and a controller neural network. The controller receives input and produces output with help of the memory that is addressed with a content- and location based addressing mechanism. Figure 1 presents a high-level diagram of the NTM architecture.<br />
<br />
<center><br />
[[File:Pre_11.PNG | frame | center |Figure 1: Neural Turing Machine Architecture. During each update cycle, the controller network receives inputs from an external environment and emits outputs in response. It also reads to and writes from a memory matrix via a set of parallel read and write heads. The dashed line indicates the division between the NTM circuit and the outside world. ]]<br />
</center><br />
<br />
<br />
== Memory ==<br />
<br />
The memory at time <math>t</math> is given by an <math>N \times M</math> matrix <math>M_t</math>, where <math>N</math> is the number of memory locations and <math>M</math> the vector size at each memory location. To address memory locations for reading or writing an <math>N</math>-element vector <math>w_t</math> is used. The elements in this vector need to satisfy <math>0 \leq w_t(i) \leq 1</math> and have to sum to 1. Thus, it gives weighting of memory locations and the access to a location might be blurry.<br />
<br />
=== Reading ===<br />
<br />
Given an address <math>w_t</math> the read vector is just the weighted sum of memory locations:<br />
<br />
<math>r_t \leftarrow \sum_i w_t(i) M_t(i)</math><br />
<br />
which is clearly differentiable with respect to both the memory and the weighting.<br />
<br />
=== Writing ===<br />
<br />
The write process is split up into an erase and an add operation (inspired by the input and forget gates in LSTM). This allows the NTM to both overwrite or add to a memory location in a single time step. Otherwise it would be necessary to do a read for one of the operations first before the updated result can be written.<br />
<br />
The erase update is given by<br />
<br />
<math>\tilde{M}_t(i) \leftarrow M_{t-1}(i) [1 - w_t(i) e_t]</math><br />
<br />
with an <math>M</math>-element ''erase vector'' <math>e_t</math> with elements in the range <math>(0, 1)</math> selecting which vector elements to reset at the memory locations selected by <math>w_t</math>.<br />
<br />
Afterwords an ''add vector'' <math>a_t</math> is added according to<br />
<br />
<math>M_t(i) \leftarrow \tilde{M}_t(i) + w_t(i) a_t.</math><br />
<br />
The order in which the adds are performed by multiple heads is irrelevant. The combined erase and add operations of all the write heads produced the final content of the memory at time ''t''.<br />
<br />
== Addressing Mechanisms ==<br />
<br />
Two methods, content-based addressing and location-based addressing, are employed to generate the read/write weightings <math>w_t</math>. Depending on the task either mechanism can be more appropriate. These two methods of addressing are summarized in the figure below.<br />
<br />
[[File:flow_diagram_addressing_mechanism.JPG | center]]<br />
<br />
=== Content-based addressing ===<br />
<br />
For content-addressing, each head (whether employed for reading or writing) first produces a length <math>M</math> key vector <math>k_t</math> that is compared to each vector <math>M_t (i)</math> by a similarity measure <math>K[.,.]</math>. The content-based system produces a normalised weighting <math>w_c^t</math> based on the similarity and a positive key strength, <math>\beta_t</math>, which can amplify or attenuate the precision of the focus:<br />
<br />
<br />
<math><br />
w_c^t \leftarrow \frac{exp(\beta_t K[K_t,M_t(i)])}{\sum_{j} exp(\beta_t K[K_t,M_t(j)])}<br />
</math><br />
<br />
In this current implementation, the similarity measure is cosine similarity:<br />
<br />
<math><br />
K[u,v] = \frac{u.v}{||u||.||v||}<br />
</math><br />
<br />
=== Location-based addressing ===<br />
<br />
The location-based addressing mechanism is designed to facilitate both simple iterations across the locations of the memory and random-access jumps. It does so by implementing a rotational shift of a weighting. Prior to rotation, each head emits a scalar interpolation gate <math>g_t</math> in the range (0, 1). The value of <math>g</math> is used to blend between the weighting <math>w_{t-1}</math> produced by the head at the previous time-step and the weighting <math>w_t^c</math> produced by the content system at the current time-step, yielding the gated weighting <math>w_t^g</math> :<br />
<br />
<math><br />
w_t^g \leftarrow g_t w_t^c + (1-g_t) w_{t-1}<br />
</math><br />
<br />
After interpolation, each head emits a shift weighting <math>s_t</math> that defines a normalised distribution over the allowed integer shifts. Each element in this vector gives the degree by which different integer shifts are performed. For example, if shifts of -1, 0, 1 are allowed a (0, 0.3, 0.7) shift vector would denote a shift of 1 with strength 0.7 and a shift of 0 (no-shift) with strength 0.3. The actual shift is performed with a circular convolution<br />
<br />
<math>\tilde{w}_t(i) \leftarrow \sum_{j=0}^{N-1} w_t^g(j) s_t(i - j)</math><br />
<br />
where all index arithmetic is modulo N. This circular convolution can lead to blurring of the weights and <math>_t</math> will be sharpened with<br />
<br />
<math>w_t(i) \leftarrow \frac{\tilde{w}_t(i)^{\gamma_t}}{\sum_j<br />
\tilde{w}_t(j)^{\gamma_t}}</math><br />
<br />
where <math>_t 1</math> is an additional scalar outputted by the write head.<br />
<br />
== Controller ==<br />
<br />
The controller receives the external input and read head output and produces the addressing vectors and related values (for example shift weighting) for the read and write heads. It also produces an external output.<br />
<br />
Different types of controllers can be used. The paper discusses feed-forward and LSTM controllers. Feed-forward controllers are simpler, but are more limited than LSTM controllers since the type of operations they can perform are limited by the number of concurrent read and write heads. The LSTM controller, given it's internal register-like memory, does not suffer from this limitation.<br />
<br />
= Experiments =<br />
Team wanted to see if a network can be trained to copy sequences of length up to 20 could copy a sequence of length 100 with no further training. For all of the experiments, three architectures were compared: NTM with a feedforward controller, TML with an LSTM controller, and a standard LSTM network. All the tasks were supervised learning problems with binary targets; all networks had logistic sigmoid output layers and were trained with the cross-entropy objective function. Sequence prediction errors are reported in bits-per-sequence.<br />
<br />
= Results =<br />
<br />
The authors tested the NTM with a feed-forward and an LSTM controller against a pure LSTM on multiple tasks:<br />
<br />
* Copy Task: An input sequence has to reproduced.<br />
* Repeat Copy Task: An input sequence has to reproduced multiple times.<br />
* Associative Recall: After providing an input sequence the network is queried with one item of the sequence and has to produce the next.<br />
* Dynamic N-Grams: Predict the probability of the next bit being 0 or 1 given the last six bits.<br />
* Priority Sort: Sort an input sequence according to given priorities.<br />
<br />
<br />
[[File:copy_convergence.png|frame|center|Copy Task Learning Curve]]<br />
[[File:repeat_copy_convergence.png|frame|center|Repeat Copy Task Learning Curve]]<br />
[[File:recall_convergence.png|frame|center|Associative Recall Learning Curve]]<br />
[[File:ngrams_convergence.png|frame|center|Dynamic N-Grams Learning Curve]]<br />
[[File:sort_convergence.png|frame|center|Priority Sort Learning Curve]]<br />
<br />
[[File:ntm_feedforward_settings.png|frame|center|NTM with Feedforard Controller Experimental Settings]]<br />
[[File:ntm_ltsm_settings.png|frame|center|NTM with LTSM Controller Experimental Settings]]<br />
[[File:ltsm_settings.png|frame|center|LTSM Controller Experimental Settings]]<br />
<br />
In all tasks the combination of NTM with Feedforward or LTSM converges faster and obtains better generalization than a pure LSTM controller.<br />
<br />
<br />
= Discussion =<br />
* While the experimental results show great promise of the NTM architecture, the paper could be serviced with a more in-depth experimental results discussion as to why NTM performs very well when combined with Feedforward or LTSM compared to a pure LTSM.<br />
<br />
* The convergence performance difference between choosing NTM controller with FeedForward vs LTSM appears to hinge on whether the task requires LTSM's internal memory or NTM's external memory as an effective way to store data. Otherwise both controllers are comparable in terms of performance with each other.<br />
<br />
* A bit skeptical about the efforts in tuning LTSM, the paper gives the feeling that the authors spent a lot of time tuning the NTM with different number of heads and controller size in order to achieve desired results for publication.<br />
<br />
* Interested in knowing quantitatively how it would compare against other algorithms such as [https://en.wikipedia.org/wiki/Genetic_programming Genetic Programming] to evolve a turing machines <ref>Naidoo, Amashini, and Nelishia Pillay. "Using genetic programming for turing machine induction." Genetic Programming. Springer Berlin Heidelberg, 2008. 350-361.</ref>, where by its output is a "Program" which theoretically should be better because it doesn't use weights, the "Program" should be more robust, and will require a lot less parameters.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=neural_Turing_Machines&diff=27110neural Turing Machines2015-12-06T22:49:30Z<p>Trttse: /* Experiments */</p>
<hr />
<div>= Neural Turing Machines =<br />
<br />
Even though recurrent neural networks (RNNs) are [https://en.wikipedia.org/wiki/Turing_completeness Turing complete] in theory, the control of logical flow and usage of external memory have been largely ignored in the machine learning literature. This might be due to the fact that the RNNs have to be wired properly to achieve the Turing completeness and this is not necessarily easy to achieve in practice. By adding an addressable memory Graves et al. try to overcome this limitation and name their approach Neural Turing Machine (NTM) as analogy to [https://en.wikipedia.org/wiki/Turing_machine Turing machines] that are finite-state machines extended with an infinite memory. Furthermore, every component of an NTM is differentiable and can, thus, be learned.<br />
<br />
== Theoretical Background == <br />
<br />
The authors state that the design of the NTM is inspired by past research spanning the disciplines of neuroscience, psychology, cognitive science and linguistics, and that the NTM can be thought of as a working memory system of the sort described in various accounts of cognitive architecture. However, the authors propose to ignore the known capacity limitations of working memory, and to introduce sophisticated gating and memory addressing operations that are typically absent in models of sort developed throughout the computational neuroscience literature. <br />
<br />
With respect to historical precedents in the cognitive science and linguistics literature, the authors situate their work in relation to a longstanding debate concerning the effectiveness of neural networks for cognitive modelling. They present their work as continuing and advancing a line of research on encoding recursively structured representations in neural networks that stemmed out of criticisms presented by Fodor and Pylyshyn in 1988 (though it is worth pointing out the authors give an incorrect summary of these criticisms - they state that Fodor and Pylyshyn argued that neural networks could not implement variable binding or perform tasks involving variable length structures, when in fact they argued that successful models of cognition require representations with constituent structure and processing mechanisms that strictly structure sensitive - see [http://www.sciantaanalytics.com/sites/default/files/fodor-pylyshyn.pdf the paper] for details). The NTM is able to deal variable length inputs and arguably performs variable binding in the sense that the memory slots in the NTM can be treated as variables to which data is bound, but the authors do not revisit these issues in any detail after presenting the results of their simulations with the NTM.<br />
<br />
= Architecture =<br />
<br />
A Neural Turing Machine consists of a memory and a controller neural network. The controller receives input and produces output with help of the memory that is addressed with a content- and location based addressing mechanism. Figure 1 presents a high-level diagram of the NTM architecture.<br />
<br />
<center><br />
[[File:Pre_11.PNG | frame | center |Figure 1: Neural Turing Machine Architecture. During each update cycle, the controller network receives inputs from an external environment and emits outputs in response. It also reads to and writes from a memory matrix via a set of parallel read and write heads. The dashed line indicates the division between the NTM circuit and the outside world. ]]<br />
</center><br />
<br />
<br />
== Memory ==<br />
<br />
The memory at time <math>t</math> is given by an <math>N \times M</math> matrix <math>M_t</math>, where <math>N</math> is the number of memory locations and <math>M</math> the vector size at each memory location. To address memory locations for reading or writing an <math>N</math>-element vector <math>w_t</math> is used. The elements in this vector need to satisfy <math>0 \leq w_t(i) \leq 1</math> and have to sum to 1. Thus, it gives weighting of memory locations and the access to a location might be blurry.<br />
<br />
=== Reading ===<br />
<br />
Given an address <math>w_t</math> the read vector is just the weighted sum of memory locations:<br />
<br />
<math>r_t \leftarrow \sum_i w_t(i) M_t(i)</math><br />
<br />
which is clearly differentiable with respect to both the memory and the weighting.<br />
<br />
=== Writing ===<br />
<br />
The write process is split up into an erase and an add operation (inspired by the input and forget gates in LSTM). This allows the NTM to both overwrite or add to a memory location in a single time step. Otherwise it would be necessary to do a read for one of the operations first before the updated result can be written.<br />
<br />
The erase update is given by<br />
<br />
<math>\tilde{M}_t(i) \leftarrow M_{t-1}(i) [1 - w_t(i) e_t]</math><br />
<br />
with an <math>M</math>-element ''erase vector'' <math>e_t</math> with elements in the range <math>(0, 1)</math> selecting which vector elements to reset at the memory locations selected by <math>w_t</math>.<br />
<br />
Afterwords an ''add vector'' <math>a_t</math> is added according to<br />
<br />
<math>M_t(i) \leftarrow \tilde{M}_t(i) + w_t(i) a_t.</math><br />
<br />
The order in which the adds are performed by multiple heads is irrelevant. The combined erase and add operations of all the write heads produced the final content of the memory at time ''t''.<br />
<br />
== Addressing Mechanisms ==<br />
<br />
Two methods, content-based addressing and location-based addressing, are employed to generate the read/write weightings <math>w_t</math>. Depending on the task either mechanism can be more appropriate. These two methods of addressing are summarized in the figure below.<br />
<br />
[[File:flow_diagram_addressing_mechanism.JPG | center]]<br />
<br />
=== Content-based addressing ===<br />
<br />
For content-addressing, each head (whether employed for reading or writing) first produces a length <math>M</math> key vector <math>k_t</math> that is compared to each vector <math>M_t (i)</math> by a similarity measure <math>K[.,.]</math>. The content-based system produces a normalised weighting <math>w_c^t</math> based on the similarity and a positive key strength, <math>\beta_t</math>, which can amplify or attenuate the precision of the focus:<br />
<br />
<br />
<math><br />
w_c^t \leftarrow \frac{exp(\beta_t K[K_t,M_t(i)])}{\sum_{j} exp(\beta_t K[K_t,M_t(j)])}<br />
</math><br />
<br />
In this current implementation, the similarity measure is cosine similarity:<br />
<br />
<math><br />
K[u,v] = \frac{u.v}{||u||.||v||}<br />
</math><br />
<br />
=== Location-based addressing ===<br />
<br />
The location-based addressing mechanism is designed to facilitate both simple iterations across the locations of the memory and random-access jumps. It does so by implementing a rotational shift of a weighting. Prior to rotation, each head emits a scalar interpolation gate <math>g_t</math> in the range (0, 1). The value of <math>g</math> is used to blend between the weighting <math>w_{t-1}</math> produced by the head at the previous time-step and the weighting <math>w_t^c</math> produced by the content system at the current time-step, yielding the gated weighting <math>w_t^g</math> :<br />
<br />
<math><br />
w_t^g \leftarrow g_t w_t^c + (1-g_t) w_{t-1}<br />
</math><br />
<br />
After interpolation, each head emits a shift weighting <math>s_t</math> that defines a normalised distribution over the allowed integer shifts. Each element in this vector gives the degree by which different integer shifts are performed. For example, if shifts of -1, 0, 1 are allowed a (0, 0.3, 0.7) shift vector would denote a shift of 1 with strength 0.7 and a shift of 0 (no-shift) with strength 0.3. The actual shift is performed with a circular convolution<br />
<br />
<math>\tilde{w}_t(i) \leftarrow \sum_{j=0}^{N-1} w_t^g(j) s_t(i - j)</math><br />
<br />
where all index arithmetic is modulo N. This circular convolution can lead to blurring of the weights and <math>_t</math> will be sharpened with<br />
<br />
<math>w_t(i) \leftarrow \frac{\tilde{w}_t(i)^{\gamma_t}}{\sum_j<br />
\tilde{w}_t(j)^{\gamma_t}}</math><br />
<br />
where <math>_t 1</math> is an additional scalar outputted by the write head.<br />
<br />
== Controller ==<br />
<br />
The controller receives the external input and read head output and produces the addressing vectors and related values (for example shift weighting) for the read and write heads. It also produces an external output.<br />
<br />
Different types of controllers can be used. The paper discusses feed-forward and LSTM controllers. Feed-forward controllers are simpler, but are more limited than LSTM controllers since the type of operations they can perform are limited by the number of concurrent read and write heads. The LSTM controller, given it's internal register-like memory, does not suffer from this limitation.<br />
<br />
= Experiments =<br />
Team wanted to see if a network can be trained to copy sequences of length up to 20 could copy a sequence of length 100 with no further training.<br />
<br />
= Results =<br />
<br />
The authors tested the NTM with a feed-forward and an LSTM controller against a pure LSTM on multiple tasks:<br />
<br />
* Copy Task: An input sequence has to reproduced.<br />
* Repeat Copy Task: An input sequence has to reproduced multiple times.<br />
* Associative Recall: After providing an input sequence the network is queried with one item of the sequence and has to produce the next.<br />
* Dynamic N-Grams: Predict the probability of the next bit being 0 or 1 given the last six bits.<br />
* Priority Sort: Sort an input sequence according to given priorities.<br />
<br />
<br />
[[File:copy_convergence.png|frame|center|Copy Task Learning Curve]]<br />
[[File:repeat_copy_convergence.png|frame|center|Repeat Copy Task Learning Curve]]<br />
[[File:recall_convergence.png|frame|center|Associative Recall Learning Curve]]<br />
[[File:ngrams_convergence.png|frame|center|Dynamic N-Grams Learning Curve]]<br />
[[File:sort_convergence.png|frame|center|Priority Sort Learning Curve]]<br />
<br />
[[File:ntm_feedforward_settings.png|frame|center|NTM with Feedforard Controller Experimental Settings]]<br />
[[File:ntm_ltsm_settings.png|frame|center|NTM with LTSM Controller Experimental Settings]]<br />
[[File:ltsm_settings.png|frame|center|LTSM Controller Experimental Settings]]<br />
<br />
In all tasks the combination of NTM with Feedforward or LTSM converges faster and obtains better generalization than a pure LSTM controller.<br />
<br />
<br />
= Discussion =<br />
* While the experimental results show great promise of the NTM architecture, the paper could be serviced with a more in-depth experimental results discussion as to why NTM performs very well when combined with Feedforward or LTSM compared to a pure LTSM.<br />
<br />
* The convergence performance difference between choosing NTM controller with FeedForward vs LTSM appears to hinge on whether the task requires LTSM's internal memory or NTM's external memory as an effective way to store data. Otherwise both controllers are comparable in terms of performance with each other.<br />
<br />
* A bit skeptical about the efforts in tuning LTSM, the paper gives the feeling that the authors spent a lot of time tuning the NTM with different number of heads and controller size in order to achieve desired results for publication.<br />
<br />
* Interested in knowing quantitatively how it would compare against other algorithms such as [https://en.wikipedia.org/wiki/Genetic_programming Genetic Programming] to evolve a turing machines <ref>Naidoo, Amashini, and Nelishia Pillay. "Using genetic programming for turing machine induction." Genetic Programming. Springer Berlin Heidelberg, 2008. 350-361.</ref>, where by its output is a "Program" which theoretically should be better because it doesn't use weights, the "Program" should be more robust, and will require a lot less parameters.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=neural_Turing_Machines&diff=27109neural Turing Machines2015-12-06T22:47:43Z<p>Trttse: </p>
<hr />
<div>= Neural Turing Machines =<br />
<br />
Even though recurrent neural networks (RNNs) are [https://en.wikipedia.org/wiki/Turing_completeness Turing complete] in theory, the control of logical flow and usage of external memory have been largely ignored in the machine learning literature. This might be due to the fact that the RNNs have to be wired properly to achieve the Turing completeness and this is not necessarily easy to achieve in practice. By adding an addressable memory Graves et al. try to overcome this limitation and name their approach Neural Turing Machine (NTM) as analogy to [https://en.wikipedia.org/wiki/Turing_machine Turing machines] that are finite-state machines extended with an infinite memory. Furthermore, every component of an NTM is differentiable and can, thus, be learned.<br />
<br />
== Theoretical Background == <br />
<br />
The authors state that the design of the NTM is inspired by past research spanning the disciplines of neuroscience, psychology, cognitive science and linguistics, and that the NTM can be thought of as a working memory system of the sort described in various accounts of cognitive architecture. However, the authors propose to ignore the known capacity limitations of working memory, and to introduce sophisticated gating and memory addressing operations that are typically absent in models of sort developed throughout the computational neuroscience literature. <br />
<br />
With respect to historical precedents in the cognitive science and linguistics literature, the authors situate their work in relation to a longstanding debate concerning the effectiveness of neural networks for cognitive modelling. They present their work as continuing and advancing a line of research on encoding recursively structured representations in neural networks that stemmed out of criticisms presented by Fodor and Pylyshyn in 1988 (though it is worth pointing out the authors give an incorrect summary of these criticisms - they state that Fodor and Pylyshyn argued that neural networks could not implement variable binding or perform tasks involving variable length structures, when in fact they argued that successful models of cognition require representations with constituent structure and processing mechanisms that strictly structure sensitive - see [http://www.sciantaanalytics.com/sites/default/files/fodor-pylyshyn.pdf the paper] for details). The NTM is able to deal variable length inputs and arguably performs variable binding in the sense that the memory slots in the NTM can be treated as variables to which data is bound, but the authors do not revisit these issues in any detail after presenting the results of their simulations with the NTM.<br />
<br />
= Architecture =<br />
<br />
A Neural Turing Machine consists of a memory and a controller neural network. The controller receives input and produces output with help of the memory that is addressed with a content- and location based addressing mechanism. Figure 1 presents a high-level diagram of the NTM architecture.<br />
<br />
<center><br />
[[File:Pre_11.PNG | frame | center |Figure 1: Neural Turing Machine Architecture. During each update cycle, the controller network receives inputs from an external environment and emits outputs in response. It also reads to and writes from a memory matrix via a set of parallel read and write heads. The dashed line indicates the division between the NTM circuit and the outside world. ]]<br />
</center><br />
<br />
<br />
== Memory ==<br />
<br />
The memory at time <math>t</math> is given by an <math>N \times M</math> matrix <math>M_t</math>, where <math>N</math> is the number of memory locations and <math>M</math> the vector size at each memory location. To address memory locations for reading or writing an <math>N</math>-element vector <math>w_t</math> is used. The elements in this vector need to satisfy <math>0 \leq w_t(i) \leq 1</math> and have to sum to 1. Thus, it gives weighting of memory locations and the access to a location might be blurry.<br />
<br />
=== Reading ===<br />
<br />
Given an address <math>w_t</math> the read vector is just the weighted sum of memory locations:<br />
<br />
<math>r_t \leftarrow \sum_i w_t(i) M_t(i)</math><br />
<br />
which is clearly differentiable with respect to both the memory and the weighting.<br />
<br />
=== Writing ===<br />
<br />
The write process is split up into an erase and an add operation (inspired by the input and forget gates in LSTM). This allows the NTM to both overwrite or add to a memory location in a single time step. Otherwise it would be necessary to do a read for one of the operations first before the updated result can be written.<br />
<br />
The erase update is given by<br />
<br />
<math>\tilde{M}_t(i) \leftarrow M_{t-1}(i) [1 - w_t(i) e_t]</math><br />
<br />
with an <math>M</math>-element ''erase vector'' <math>e_t</math> with elements in the range <math>(0, 1)</math> selecting which vector elements to reset at the memory locations selected by <math>w_t</math>.<br />
<br />
Afterwords an ''add vector'' <math>a_t</math> is added according to<br />
<br />
<math>M_t(i) \leftarrow \tilde{M}_t(i) + w_t(i) a_t.</math><br />
<br />
The order in which the adds are performed by multiple heads is irrelevant. The combined erase and add operations of all the write heads produced the final content of the memory at time ''t''.<br />
<br />
== Addressing Mechanisms ==<br />
<br />
Two methods, content-based addressing and location-based addressing, are employed to generate the read/write weightings <math>w_t</math>. Depending on the task either mechanism can be more appropriate. These two methods of addressing are summarized in the figure below.<br />
<br />
[[File:flow_diagram_addressing_mechanism.JPG | center]]<br />
<br />
=== Content-based addressing ===<br />
<br />
For content-addressing, each head (whether employed for reading or writing) first produces a length <math>M</math> key vector <math>k_t</math> that is compared to each vector <math>M_t (i)</math> by a similarity measure <math>K[.,.]</math>. The content-based system produces a normalised weighting <math>w_c^t</math> based on the similarity and a positive key strength, <math>\beta_t</math>, which can amplify or attenuate the precision of the focus:<br />
<br />
<br />
<math><br />
w_c^t \leftarrow \frac{exp(\beta_t K[K_t,M_t(i)])}{\sum_{j} exp(\beta_t K[K_t,M_t(j)])}<br />
</math><br />
<br />
In this current implementation, the similarity measure is cosine similarity:<br />
<br />
<math><br />
K[u,v] = \frac{u.v}{||u||.||v||}<br />
</math><br />
<br />
=== Location-based addressing ===<br />
<br />
The location-based addressing mechanism is designed to facilitate both simple iterations across the locations of the memory and random-access jumps. It does so by implementing a rotational shift of a weighting. Prior to rotation, each head emits a scalar interpolation gate <math>g_t</math> in the range (0, 1). The value of <math>g</math> is used to blend between the weighting <math>w_{t-1}</math> produced by the head at the previous time-step and the weighting <math>w_t^c</math> produced by the content system at the current time-step, yielding the gated weighting <math>w_t^g</math> :<br />
<br />
<math><br />
w_t^g \leftarrow g_t w_t^c + (1-g_t) w_{t-1}<br />
</math><br />
<br />
After interpolation, each head emits a shift weighting <math>s_t</math> that defines a normalised distribution over the allowed integer shifts. Each element in this vector gives the degree by which different integer shifts are performed. For example, if shifts of -1, 0, 1 are allowed a (0, 0.3, 0.7) shift vector would denote a shift of 1 with strength 0.7 and a shift of 0 (no-shift) with strength 0.3. The actual shift is performed with a circular convolution<br />
<br />
<math>\tilde{w}_t(i) \leftarrow \sum_{j=0}^{N-1} w_t^g(j) s_t(i - j)</math><br />
<br />
where all index arithmetic is modulo N. This circular convolution can lead to blurring of the weights and <math>_t</math> will be sharpened with<br />
<br />
<math>w_t(i) \leftarrow \frac{\tilde{w}_t(i)^{\gamma_t}}{\sum_j<br />
\tilde{w}_t(j)^{\gamma_t}}</math><br />
<br />
where <math>_t 1</math> is an additional scalar outputted by the write head.<br />
<br />
== Controller ==<br />
<br />
The controller receives the external input and read head output and produces the addressing vectors and related values (for example shift weighting) for the read and write heads. It also produces an external output.<br />
<br />
Different types of controllers can be used. The paper discusses feed-forward and LSTM controllers. Feed-forward controllers are simpler, but are more limited than LSTM controllers since the type of operations they can perform are limited by the number of concurrent read and write heads. The LSTM controller, given it's internal register-like memory, does not suffer from this limitation.<br />
<br />
== Experiments ==<br />
<br />
= Results =<br />
<br />
The authors tested the NTM with a feed-forward and an LSTM controller against a pure LSTM on multiple tasks:<br />
<br />
* Copy Task: An input sequence has to reproduced.<br />
* Repeat Copy Task: An input sequence has to reproduced multiple times.<br />
* Associative Recall: After providing an input sequence the network is queried with one item of the sequence and has to produce the next.<br />
* Dynamic N-Grams: Predict the probability of the next bit being 0 or 1 given the last six bits.<br />
* Priority Sort: Sort an input sequence according to given priorities.<br />
<br />
<br />
[[File:copy_convergence.png|frame|center|Copy Task Learning Curve]]<br />
[[File:repeat_copy_convergence.png|frame|center|Repeat Copy Task Learning Curve]]<br />
[[File:recall_convergence.png|frame|center|Associative Recall Learning Curve]]<br />
[[File:ngrams_convergence.png|frame|center|Dynamic N-Grams Learning Curve]]<br />
[[File:sort_convergence.png|frame|center|Priority Sort Learning Curve]]<br />
<br />
[[File:ntm_feedforward_settings.png|frame|center|NTM with Feedforard Controller Experimental Settings]]<br />
[[File:ntm_ltsm_settings.png|frame|center|NTM with LTSM Controller Experimental Settings]]<br />
[[File:ltsm_settings.png|frame|center|LTSM Controller Experimental Settings]]<br />
<br />
In all tasks the combination of NTM with Feedforward or LTSM converges faster and obtains better generalization than a pure LSTM controller.<br />
<br />
<br />
= Discussion =<br />
* While the experimental results show great promise of the NTM architecture, the paper could be serviced with a more in-depth experimental results discussion as to why NTM performs very well when combined with Feedforward or LTSM compared to a pure LTSM.<br />
<br />
* The convergence performance difference between choosing NTM controller with FeedForward vs LTSM appears to hinge on whether the task requires LTSM's internal memory or NTM's external memory as an effective way to store data. Otherwise both controllers are comparable in terms of performance with each other.<br />
<br />
* A bit skeptical about the efforts in tuning LTSM, the paper gives the feeling that the authors spent a lot of time tuning the NTM with different number of heads and controller size in order to achieve desired results for publication.<br />
<br />
* Interested in knowing quantitatively how it would compare against other algorithms such as [https://en.wikipedia.org/wiki/Genetic_programming Genetic Programming] to evolve a turing machines <ref>Naidoo, Amashini, and Nelishia Pillay. "Using genetic programming for turing machine induction." Genetic Programming. Springer Berlin Heidelberg, 2008. 350-361.</ref>, where by its output is a "Program" which theoretically should be better because it doesn't use weights, the "Program" should be more robust, and will require a lot less parameters.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Flow_diagram_addressing_mechanism.JPG&diff=27108File:Flow diagram addressing mechanism.JPG2015-12-06T22:44:55Z<p>Trttse: uploaded a new version of &quot;File:Flow diagram addressing mechanism.JPG&quot;</p>
<hr />
<div></div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Flow_diagram_addressing_mechanism.JPG&diff=27107File:Flow diagram addressing mechanism.JPG2015-12-06T22:43:23Z<p>Trttse: </p>
<hr />
<div></div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=neural_Turing_Machines&diff=27106neural Turing Machines2015-12-06T22:40:09Z<p>Trttse: /* Addressing Mechanisms */</p>
<hr />
<div>= Neural Turing Machines =<br />
<br />
Even though recurrent neural networks (RNNs) are [https://en.wikipedia.org/wiki/Turing_completeness Turing complete] in theory, the control of logical flow and usage of external memory have been largely ignored in the machine learning literature. This might be due to the fact that the RNNs have to be wired properly to achieve the Turing completeness and this is not necessarily easy to achieve in practice. By adding an addressable memory Graves et al. try to overcome this limitation and name their approach Neural Turing Machine (NTM) as analogy to [https://en.wikipedia.org/wiki/Turing_machine Turing machines] that are finite-state machines extended with an infinite memory. Furthermore, every component of an NTM is differentiable and can, thus, be learned.<br />
<br />
== Theoretical Background == <br />
<br />
The authors state that the design of the NTM is inspired by past research spanning the disciplines of neuroscience, psychology, cognitive science and linguistics, and that the NTM can be thought of as a working memory system of the sort described in various accounts of cognitive architecture. However, the authors propose to ignore the known capacity limitations of working memory, and to introduce sophisticated gating and memory addressing operations that are typically absent in models of sort developed throughout the computational neuroscience literature. <br />
<br />
With respect to historical precedents in the cognitive science and linguistics literature, the authors situate their work in relation to a longstanding debate concerning the effectiveness of neural networks for cognitive modelling. They present their work as continuing and advancing a line of research on encoding recursively structured representations in neural networks that stemmed out of criticisms presented by Fodor and Pylyshyn in 1988 (though it is worth pointing out the authors give an incorrect summary of these criticisms - they state that Fodor and Pylyshyn argued that neural networks could not implement variable binding or perform tasks involving variable length structures, when in fact they argued that successful models of cognition require representations with constituent structure and processing mechanisms that strictly structure sensitive - see [http://www.sciantaanalytics.com/sites/default/files/fodor-pylyshyn.pdf the paper] for details). The NTM is able to deal variable length inputs and arguably performs variable binding in the sense that the memory slots in the NTM can be treated as variables to which data is bound, but the authors do not revisit these issues in any detail after presenting the results of their simulations with the NTM.<br />
<br />
= Architecture =<br />
<br />
A Neural Turing Machine consists of a memory and a controller neural network. The controller receives input and produces output with help of the memory that is addressed with a content- and location based addressing mechanism. Figure 1 presents a high-level diagram of the NTM architecture.<br />
<br />
<center><br />
[[File:Pre_11.PNG | frame | center |Figure 1: Neural Turing Machine Architecture. During each update cycle, the controller network receives inputs from an external environment and emits outputs in response. It also reads to and writes from a memory matrix via a set of parallel read and write heads. The dashed line indicates the division between the NTM circuit and the outside world. ]]<br />
</center><br />
<br />
<br />
== Memory ==<br />
<br />
The memory at time <math>t</math> is given by an <math>N \times M</math> matrix <math>M_t</math>, where <math>N</math> is the number of memory locations and <math>M</math> the vector size at each memory location. To address memory locations for reading or writing an <math>N</math>-element vector <math>w_t</math> is used. The elements in this vector need to satisfy <math>0 \leq w_t(i) \leq 1</math> and have to sum to 1. Thus, it gives weighting of memory locations and the access to a location might be blurry.<br />
<br />
=== Reading ===<br />
<br />
Given an address <math>w_t</math> the read vector is just the weighted sum of memory locations:<br />
<br />
<math>r_t \leftarrow \sum_i w_t(i) M_t(i)</math><br />
<br />
which is clearly differentiable with respect to both the memory and the weighting.<br />
<br />
=== Writing ===<br />
<br />
The write process is split up into an erase and an add operation (inspired by the input and forget gates in LSTM). This allows the NTM to both overwrite or add to a memory location in a single time step. Otherwise it would be necessary to do a read for one of the operations first before the updated result can be written.<br />
<br />
The erase update is given by<br />
<br />
<math>\tilde{M}_t(i) \leftarrow M_{t-1}(i) [1 - w_t(i) e_t]</math><br />
<br />
with an <math>M</math>-element ''erase vector'' <math>e_t</math> with elements in the range <math>(0, 1)</math> selecting which vector elements to reset at the memory locations selected by <math>w_t</math>.<br />
<br />
Afterwords an ''add vector'' <math>a_t</math> is added according to<br />
<br />
<math>M_t(i) \leftarrow \tilde{M}_t(i) + w_t(i) a_t.</math><br />
<br />
The order in which the adds are performed by multiple heads is irrelevant. The combined erase and add operations of all the write heads produced the final content of the memory at time ''t''.<br />
<br />
== Addressing Mechanisms ==<br />
<br />
Two methods, content-based addressing and location-based addressing, are employed to generate the read/write weightings <math>w_t</math>. Depending on the task either mechanism can be more appropriate. These two methods of addressing are summarized in the figure below.<br />
<br />
[[File:flow_diagram_addressing_mechanism.JPG | center]]<br />
<br />
=== Content-based addressing ===<br />
<br />
For content-addressing, each head (whether employed for reading or writing) first produces a length <math>M</math> key vector <math>k_t</math> that is compared to each vector <math>M_t (i)</math> by a similarity measure <math>K[.,.]</math>. The content-based system produces a normalised weighting <math>w_c^t</math> based on the similarity and a positive key strength, <math>\beta_t</math>, which can amplify or attenuate the precision of the focus:<br />
<br />
<br />
<math><br />
w_c^t \leftarrow \frac{exp(\beta_t K[K_t,M_t(i)])}{\sum_{j} exp(\beta_t K[K_t,M_t(j)])}<br />
</math><br />
<br />
In this current implementation, the similarity measure is cosine similarity:<br />
<br />
<math><br />
K[u,v] = \frac{u.v}{||u||.||v||}<br />
</math><br />
<br />
=== Location-based addressing ===<br />
<br />
The location-based addressing mechanism is designed to facilitate both simple iterations across the locations of the memory and random-access jumps. It does so by implementing a rotational shift of a weighting. Prior to rotation, each head emits a scalar interpolation gate <math>g_t</math> in the range (0, 1). The value of <math>g</math> is used to blend between the weighting <math>w_{t-1}</math> produced by the head at the previous time-step and the weighting <math>w_t^c</math> produced by the content system at the current time-step, yielding the gated weighting <math>w_t^g</math> :<br />
<br />
<math><br />
w_t^g \leftarrow g_t w_t^c + (1-g_t) w_{t-1}<br />
</math><br />
<br />
After interpolation, each head emits a shift weighting <math>s_t</math> that defines a normalised distribution over the allowed integer shifts. Each element in this vector gives the degree by which different integer shifts are performed. For example, if shifts of -1, 0, 1 are allowed a (0, 0.3, 0.7) shift vector would denote a shift of 1 with strength 0.7 and a shift of 0 (no-shift) with strength 0.3. The actual shift is performed with a circular convolution<br />
<br />
<math>\tilde{w}_t(i) \leftarrow \sum_{j=0}^{N-1} w_t^g(j) s_t(i - j)</math><br />
<br />
where all index arithmetic is modulo N. This circular convolution can lead to blurring of the weights and <math>_t</math> will be sharpened with<br />
<br />
<math>w_t(i) \leftarrow \frac{\tilde{w}_t(i)^{\gamma_t}}{\sum_j<br />
\tilde{w}_t(j)^{\gamma_t}}</math><br />
<br />
where <math>_t 1</math> is an additional scalar outputted by the write head.<br />
<br />
== Controller ==<br />
<br />
The controller receives the external input and read head output and produces the addressing vectors and related values (for example shift weighting) for the read and write heads. It also produces an external output.<br />
<br />
Different types of controllers can be used. The paper discusses feed-forward and LSTM controllers. Feed-forward controllers are simpler, but are more limited than LSTM controllers since the type of operations they can perform are limited by the number of concurrent read and write heads. The LSTM controller, given it's internal register-like memory, does not suffer from this limitation.<br />
<br />
= Results =<br />
<br />
The authors tested the NTM with a feed-forward and an LSTM controller against a pure LSTM on multiple tasks:<br />
<br />
* Copy Task: An input sequence has to reproduced.<br />
* Repeat Copy Task: An input sequence has to reproduced multiple times.<br />
* Associative Recall: After providing an input sequence the network is queried with one item of the sequence and has to produce the next.<br />
* Dynamic N-Grams: Predict the probability of the next bit being 0 or 1 given the last six bits.<br />
* Priority Sort: Sort an input sequence according to given priorities.<br />
<br />
<br />
[[File:copy_convergence.png|frame|center|Copy Task Learning Curve]]<br />
[[File:repeat_copy_convergence.png|frame|center|Repeat Copy Task Learning Curve]]<br />
[[File:recall_convergence.png|frame|center|Associative Recall Learning Curve]]<br />
[[File:ngrams_convergence.png|frame|center|Dynamic N-Grams Learning Curve]]<br />
[[File:sort_convergence.png|frame|center|Priority Sort Learning Curve]]<br />
<br />
[[File:ntm_feedforward_settings.png|frame|center|NTM with Feedforard Controller Experimental Settings]]<br />
[[File:ntm_ltsm_settings.png|frame|center|NTM with LTSM Controller Experimental Settings]]<br />
[[File:ltsm_settings.png|frame|center|LTSM Controller Experimental Settings]]<br />
<br />
In all tasks the combination of NTM with Feedforward or LTSM converges faster and obtains better generalization than a pure LSTM controller.<br />
<br />
<br />
= Discussion =<br />
* While the experimental results show great promise of the NTM architecture, the paper could be serviced with a more in-depth experimental results discussion as to why NTM performs very well when combined with Feedforward or LTSM compared to a pure LTSM.<br />
<br />
* The convergence performance difference between choosing NTM controller with FeedForward vs LTSM appears to hinge on whether the task requires LTSM's internal memory or NTM's external memory as an effective way to store data. Otherwise both controllers are comparable in terms of performance with each other.<br />
<br />
* A bit skeptical about the efforts in tuning LTSM, the paper gives the feeling that the authors spent a lot of time tuning the NTM with different number of heads and controller size in order to achieve desired results for publication.<br />
<br />
* Interested in knowing quantitatively how it would compare against other algorithms such as [https://en.wikipedia.org/wiki/Genetic_programming Genetic Programming] to evolve a turing machines <ref>Naidoo, Amashini, and Nelishia Pillay. "Using genetic programming for turing machine induction." Genetic Programming. Springer Berlin Heidelberg, 2008. 350-361.</ref>, where by its output is a "Program" which theoretically should be better because it doesn't use weights, the "Program" should be more robust, and will require a lot less parameters.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=neural_Turing_Machines&diff=27105neural Turing Machines2015-12-06T22:35:44Z<p>Trttse: /* Writing */</p>
<hr />
<div>= Neural Turing Machines =<br />
<br />
Even though recurrent neural networks (RNNs) are [https://en.wikipedia.org/wiki/Turing_completeness Turing complete] in theory, the control of logical flow and usage of external memory have been largely ignored in the machine learning literature. This might be due to the fact that the RNNs have to be wired properly to achieve the Turing completeness and this is not necessarily easy to achieve in practice. By adding an addressable memory Graves et al. try to overcome this limitation and name their approach Neural Turing Machine (NTM) as analogy to [https://en.wikipedia.org/wiki/Turing_machine Turing machines] that are finite-state machines extended with an infinite memory. Furthermore, every component of an NTM is differentiable and can, thus, be learned.<br />
<br />
== Theoretical Background == <br />
<br />
The authors state that the design of the NTM is inspired by past research spanning the disciplines of neuroscience, psychology, cognitive science and linguistics, and that the NTM can be thought of as a working memory system of the sort described in various accounts of cognitive architecture. However, the authors propose to ignore the known capacity limitations of working memory, and to introduce sophisticated gating and memory addressing operations that are typically absent in models of sort developed throughout the computational neuroscience literature. <br />
<br />
With respect to historical precedents in the cognitive science and linguistics literature, the authors situate their work in relation to a longstanding debate concerning the effectiveness of neural networks for cognitive modelling. They present their work as continuing and advancing a line of research on encoding recursively structured representations in neural networks that stemmed out of criticisms presented by Fodor and Pylyshyn in 1988 (though it is worth pointing out the authors give an incorrect summary of these criticisms - they state that Fodor and Pylyshyn argued that neural networks could not implement variable binding or perform tasks involving variable length structures, when in fact they argued that successful models of cognition require representations with constituent structure and processing mechanisms that strictly structure sensitive - see [http://www.sciantaanalytics.com/sites/default/files/fodor-pylyshyn.pdf the paper] for details). The NTM is able to deal variable length inputs and arguably performs variable binding in the sense that the memory slots in the NTM can be treated as variables to which data is bound, but the authors do not revisit these issues in any detail after presenting the results of their simulations with the NTM.<br />
<br />
= Architecture =<br />
<br />
A Neural Turing Machine consists of a memory and a controller neural network. The controller receives input and produces output with help of the memory that is addressed with a content- and location based addressing mechanism. Figure 1 presents a high-level diagram of the NTM architecture.<br />
<br />
<center><br />
[[File:Pre_11.PNG | frame | center |Figure 1: Neural Turing Machine Architecture. During each update cycle, the controller network receives inputs from an external environment and emits outputs in response. It also reads to and writes from a memory matrix via a set of parallel read and write heads. The dashed line indicates the division between the NTM circuit and the outside world. ]]<br />
</center><br />
<br />
<br />
== Memory ==<br />
<br />
The memory at time <math>t</math> is given by an <math>N \times M</math> matrix <math>M_t</math>, where <math>N</math> is the number of memory locations and <math>M</math> the vector size at each memory location. To address memory locations for reading or writing an <math>N</math>-element vector <math>w_t</math> is used. The elements in this vector need to satisfy <math>0 \leq w_t(i) \leq 1</math> and have to sum to 1. Thus, it gives weighting of memory locations and the access to a location might be blurry.<br />
<br />
=== Reading ===<br />
<br />
Given an address <math>w_t</math> the read vector is just the weighted sum of memory locations:<br />
<br />
<math>r_t \leftarrow \sum_i w_t(i) M_t(i)</math><br />
<br />
which is clearly differentiable with respect to both the memory and the weighting.<br />
<br />
=== Writing ===<br />
<br />
The write process is split up into an erase and an add operation (inspired by the input and forget gates in LSTM). This allows the NTM to both overwrite or add to a memory location in a single time step. Otherwise it would be necessary to do a read for one of the operations first before the updated result can be written.<br />
<br />
The erase update is given by<br />
<br />
<math>\tilde{M}_t(i) \leftarrow M_{t-1}(i) [1 - w_t(i) e_t]</math><br />
<br />
with an <math>M</math>-element ''erase vector'' <math>e_t</math> with elements in the range <math>(0, 1)</math> selecting which vector elements to reset at the memory locations selected by <math>w_t</math>.<br />
<br />
Afterwords an ''add vector'' <math>a_t</math> is added according to<br />
<br />
<math>M_t(i) \leftarrow \tilde{M}_t(i) + w_t(i) a_t.</math><br />
<br />
The order in which the adds are performed by multiple heads is irrelevant. The combined erase and add operations of all the write heads produced the final content of the memory at time ''t''.<br />
<br />
== Addressing Mechanisms ==<br />
<br />
Two methods, content-based addressing and location-based addressing, are employed to generate the read/write weightings <math>w_t</math>. Depending on the task either mechanism can be more appropriate.<br />
<br />
=== Content-based addressing ===<br />
<br />
For content-addressing, each head (whether employed for reading or writing) first produces a length <math>M</math> key vector <math>k_t</math> that is compared to each vector <math>M_t (i)</math> by a similarity measure <math>K[.,.]</math>. The content-based system produces a normalised weighting <math>w_c^t</math> based on the similarity and a positive key strength, <math>\beta_t</math>, which can amplify or attenuate the precision of the focus:<br />
<br />
<br />
<math><br />
w_c^t \leftarrow \frac{exp(\beta_t K[K_t,M_t(i)])}{\sum_{j} exp(\beta_t K[K_t,M_t(j)])}<br />
</math><br />
<br />
In this current implementation, the similarity measure is cosine similarity:<br />
<br />
<math><br />
K[u,v] = \frac{u.v}{||u||.||v||}<br />
</math><br />
<br />
=== Location-based addressing ===<br />
<br />
The location-based addressing mechanism is designed to facilitate both simple iterations across the locations of the memory and random-access jumps. It does so by implementing a rotational shift of a weighting. Prior to rotation, each head emits a scalar interpolation gate <math>g_t</math> in the range (0, 1). The value of <math>g</math> is used to blend between the weighting <math>w_{t-1}</math> produced by the head at the previous time-step and the weighting <math>w_t^c</math> produced by the content system at the current time-step, yielding the gated weighting <math>w_t^g</math> :<br />
<br />
<math><br />
w_t^g \leftarrow g_t w_t^c + (1-g_t) w_{t-1}<br />
</math><br />
<br />
After interpolation, each head emits a shift weighting <math>s_t</math> that defines a normalised distribution over the allowed integer shifts. Each element in this vector gives the degree by which different integer shifts are performed. For example, if shifts of -1, 0, 1 are allowed a (0, 0.3, 0.7) shift vector would denote a shift of 1 with strength 0.7 and a shift of 0 (no-shift) with strength 0.3. The actual shift is performed with a circular convolution<br />
<br />
<math>\tilde{w}_t(i) \leftarrow \sum_{j=0}^{N-1} w_t^g(j) s_t(i - j)</math><br />
<br />
where all index arithmetic is modulo N. This circular convolution can lead to blurring of the weights and <math>_t</math> will be sharpened with<br />
<br />
<math>w_t(i) \leftarrow \frac{\tilde{w}_t(i)^{\gamma_t}}{\sum_j<br />
\tilde{w}_t(j)^{\gamma_t}}</math><br />
<br />
where <math>_t 1</math> is an additional scalar outputted by the write head.<br />
<br />
== Controller ==<br />
<br />
The controller receives the external input and read head output and produces the addressing vectors and related values (for example shift weighting) for the read and write heads. It also produces an external output.<br />
<br />
Different types of controllers can be used. The paper discusses feed-forward and LSTM controllers. Feed-forward controllers are simpler, but are more limited than LSTM controllers since the type of operations they can perform are limited by the number of concurrent read and write heads. The LSTM controller, given it's internal register-like memory, does not suffer from this limitation.<br />
<br />
= Results =<br />
<br />
The authors tested the NTM with a feed-forward and an LSTM controller against a pure LSTM on multiple tasks:<br />
<br />
* Copy Task: An input sequence has to reproduced.<br />
* Repeat Copy Task: An input sequence has to reproduced multiple times.<br />
* Associative Recall: After providing an input sequence the network is queried with one item of the sequence and has to produce the next.<br />
* Dynamic N-Grams: Predict the probability of the next bit being 0 or 1 given the last six bits.<br />
* Priority Sort: Sort an input sequence according to given priorities.<br />
<br />
<br />
[[File:copy_convergence.png|frame|center|Copy Task Learning Curve]]<br />
[[File:repeat_copy_convergence.png|frame|center|Repeat Copy Task Learning Curve]]<br />
[[File:recall_convergence.png|frame|center|Associative Recall Learning Curve]]<br />
[[File:ngrams_convergence.png|frame|center|Dynamic N-Grams Learning Curve]]<br />
[[File:sort_convergence.png|frame|center|Priority Sort Learning Curve]]<br />
<br />
[[File:ntm_feedforward_settings.png|frame|center|NTM with Feedforard Controller Experimental Settings]]<br />
[[File:ntm_ltsm_settings.png|frame|center|NTM with LTSM Controller Experimental Settings]]<br />
[[File:ltsm_settings.png|frame|center|LTSM Controller Experimental Settings]]<br />
<br />
In all tasks the combination of NTM with Feedforward or LTSM converges faster and obtains better generalization than a pure LSTM controller.<br />
<br />
<br />
= Discussion =<br />
* While the experimental results show great promise of the NTM architecture, the paper could be serviced with a more in-depth experimental results discussion as to why NTM performs very well when combined with Feedforward or LTSM compared to a pure LTSM.<br />
<br />
* The convergence performance difference between choosing NTM controller with FeedForward vs LTSM appears to hinge on whether the task requires LTSM's internal memory or NTM's external memory as an effective way to store data. Otherwise both controllers are comparable in terms of performance with each other.<br />
<br />
* A bit skeptical about the efforts in tuning LTSM, the paper gives the feeling that the authors spent a lot of time tuning the NTM with different number of heads and controller size in order to achieve desired results for publication.<br />
<br />
* Interested in knowing quantitatively how it would compare against other algorithms such as [https://en.wikipedia.org/wiki/Genetic_programming Genetic Programming] to evolve a turing machines <ref>Naidoo, Amashini, and Nelishia Pillay. "Using genetic programming for turing machine induction." Genetic Programming. Springer Berlin Heidelberg, 2008. 350-361.</ref>, where by its output is a "Program" which theoretically should be better because it doesn't use weights, the "Program" should be more robust, and will require a lot less parameters.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=neural_Turing_Machines&diff=27104neural Turing Machines2015-12-06T22:27:35Z<p>Trttse: /* Reading */</p>
<hr />
<div>= Neural Turing Machines =<br />
<br />
Even though recurrent neural networks (RNNs) are [https://en.wikipedia.org/wiki/Turing_completeness Turing complete] in theory, the control of logical flow and usage of external memory have been largely ignored in the machine learning literature. This might be due to the fact that the RNNs have to be wired properly to achieve the Turing completeness and this is not necessarily easy to achieve in practice. By adding an addressable memory Graves et al. try to overcome this limitation and name their approach Neural Turing Machine (NTM) as analogy to [https://en.wikipedia.org/wiki/Turing_machine Turing machines] that are finite-state machines extended with an infinite memory. Furthermore, every component of an NTM is differentiable and can, thus, be learned.<br />
<br />
== Theoretical Background == <br />
<br />
The authors state that the design of the NTM is inspired by past research spanning the disciplines of neuroscience, psychology, cognitive science and linguistics, and that the NTM can be thought of as a working memory system of the sort described in various accounts of cognitive architecture. However, the authors propose to ignore the known capacity limitations of working memory, and to introduce sophisticated gating and memory addressing operations that are typically absent in models of sort developed throughout the computational neuroscience literature. <br />
<br />
With respect to historical precedents in the cognitive science and linguistics literature, the authors situate their work in relation to a longstanding debate concerning the effectiveness of neural networks for cognitive modelling. They present their work as continuing and advancing a line of research on encoding recursively structured representations in neural networks that stemmed out of criticisms presented by Fodor and Pylyshyn in 1988 (though it is worth pointing out the authors give an incorrect summary of these criticisms - they state that Fodor and Pylyshyn argued that neural networks could not implement variable binding or perform tasks involving variable length structures, when in fact they argued that successful models of cognition require representations with constituent structure and processing mechanisms that strictly structure sensitive - see [http://www.sciantaanalytics.com/sites/default/files/fodor-pylyshyn.pdf the paper] for details). The NTM is able to deal variable length inputs and arguably performs variable binding in the sense that the memory slots in the NTM can be treated as variables to which data is bound, but the authors do not revisit these issues in any detail after presenting the results of their simulations with the NTM.<br />
<br />
= Architecture =<br />
<br />
A Neural Turing Machine consists of a memory and a controller neural network. The controller receives input and produces output with help of the memory that is addressed with a content- and location based addressing mechanism. Figure 1 presents a high-level diagram of the NTM architecture.<br />
<br />
<center><br />
[[File:Pre_11.PNG | frame | center |Figure 1: Neural Turing Machine Architecture. During each update cycle, the controller network receives inputs from an external environment and emits outputs in response. It also reads to and writes from a memory matrix via a set of parallel read and write heads. The dashed line indicates the division between the NTM circuit and the outside world. ]]<br />
</center><br />
<br />
<br />
== Memory ==<br />
<br />
The memory at time <math>t</math> is given by an <math>N \times M</math> matrix <math>M_t</math>, where <math>N</math> is the number of memory locations and <math>M</math> the vector size at each memory location. To address memory locations for reading or writing an <math>N</math>-element vector <math>w_t</math> is used. The elements in this vector need to satisfy <math>0 \leq w_t(i) \leq 1</math> and have to sum to 1. Thus, it gives weighting of memory locations and the access to a location might be blurry.<br />
<br />
=== Reading ===<br />
<br />
Given an address <math>w_t</math> the read vector is just the weighted sum of memory locations:<br />
<br />
<math>r_t \leftarrow \sum_i w_t(i) M_t(i)</math><br />
<br />
which is clearly differentiable with respect to both the memory and the weighting.<br />
<br />
=== Writing ===<br />
<br />
The write process is split up into an erase and an add operation (inspired by the input and forget gates in LSTM). This allows the NTM to both overwrite or add to a memory location in a single time step. Otherwise it would be necessary to do a read for one of the operations first before the updated result can be written.<br />
<br />
The erase update is given by<br />
<br />
<math>\tilde{M}_t(i) \leftarrow M_{t-1}(i) [1 - w_t(i) e_t]</math><br />
<br />
with an <math>M</math>-element ''erase vector'' <math>e_t</math> with elements in the range <math>(0, 1)</math> selecting which vector elements to reset at the memory locations selected by <math>w_t</math>.<br />
<br />
Afterwords an ''add vector'' <math>a_t</math> is added according to<br />
<br />
<math>M_t(i) \leftarrow \tilde{M}_t(i) + w_t(i) a_t.</math><br />
<br />
== Addressing Mechanisms ==<br />
<br />
Two methods, content-based addressing and location-based addressing, are employed to generate the read/write weightings <math>w_t</math>. Depending on the task either mechanism can be more appropriate.<br />
<br />
=== Content-based addressing ===<br />
<br />
For content-addressing, each head (whether employed for reading or writing) first produces a length <math>M</math> key vector <math>k_t</math> that is compared to each vector <math>M_t (i)</math> by a similarity measure <math>K[.,.]</math>. The content-based system produces a normalised weighting <math>w_c^t</math> based on the similarity and a positive key strength, <math>\beta_t</math>, which can amplify or attenuate the precision of the focus:<br />
<br />
<br />
<math><br />
w_c^t \leftarrow \frac{exp(\beta_t K[K_t,M_t(i)])}{\sum_{j} exp(\beta_t K[K_t,M_t(j)])}<br />
</math><br />
<br />
In this current implementation, the similarity measure is cosine similarity:<br />
<br />
<math><br />
K[u,v] = \frac{u.v}{||u||.||v||}<br />
</math><br />
<br />
=== Location-based addressing ===<br />
<br />
The location-based addressing mechanism is designed to facilitate both simple iterations across the locations of the memory and random-access jumps. It does so by implementing a rotational shift of a weighting. Prior to rotation, each head emits a scalar interpolation gate <math>g_t</math> in the range (0, 1). The value of <math>g</math> is used to blend between the weighting <math>w_{t-1}</math> produced by the head at the previous time-step and the weighting <math>w_t^c</math> produced by the content system at the current time-step, yielding the gated weighting <math>w_t^g</math> :<br />
<br />
<math><br />
w_t^g \leftarrow g_t w_t^c + (1-g_t) w_{t-1}<br />
</math><br />
<br />
After interpolation, each head emits a shift weighting <math>s_t</math> that defines a normalised distribution over the allowed integer shifts. Each element in this vector gives the degree by which different integer shifts are performed. For example, if shifts of -1, 0, 1 are allowed a (0, 0.3, 0.7) shift vector would denote a shift of 1 with strength 0.7 and a shift of 0 (no-shift) with strength 0.3. The actual shift is performed with a circular convolution<br />
<br />
<math>\tilde{w}_t(i) \leftarrow \sum_{j=0}^{N-1} w_t^g(j) s_t(i - j)</math><br />
<br />
where all index arithmetic is modulo N. This circular convolution can lead to blurring of the weights and <math>_t</math> will be sharpened with<br />
<br />
<math>w_t(i) \leftarrow \frac{\tilde{w}_t(i)^{\gamma_t}}{\sum_j<br />
\tilde{w}_t(j)^{\gamma_t}}</math><br />
<br />
where <math>_t 1</math> is an additional scalar outputted by the write head.<br />
<br />
== Controller ==<br />
<br />
The controller receives the external input and read head output and produces the addressing vectors and related values (for example shift weighting) for the read and write heads. It also produces an external output.<br />
<br />
Different types of controllers can be used. The paper discusses feed-forward and LSTM controllers. Feed-forward controllers are simpler, but are more limited than LSTM controllers since the type of operations they can perform are limited by the number of concurrent read and write heads. The LSTM controller, given it's internal register-like memory, does not suffer from this limitation.<br />
<br />
= Results =<br />
<br />
The authors tested the NTM with a feed-forward and an LSTM controller against a pure LSTM on multiple tasks:<br />
<br />
* Copy Task: An input sequence has to reproduced.<br />
* Repeat Copy Task: An input sequence has to reproduced multiple times.<br />
* Associative Recall: After providing an input sequence the network is queried with one item of the sequence and has to produce the next.<br />
* Dynamic N-Grams: Predict the probability of the next bit being 0 or 1 given the last six bits.<br />
* Priority Sort: Sort an input sequence according to given priorities.<br />
<br />
<br />
[[File:copy_convergence.png|frame|center|Copy Task Learning Curve]]<br />
[[File:repeat_copy_convergence.png|frame|center|Repeat Copy Task Learning Curve]]<br />
[[File:recall_convergence.png|frame|center|Associative Recall Learning Curve]]<br />
[[File:ngrams_convergence.png|frame|center|Dynamic N-Grams Learning Curve]]<br />
[[File:sort_convergence.png|frame|center|Priority Sort Learning Curve]]<br />
<br />
[[File:ntm_feedforward_settings.png|frame|center|NTM with Feedforard Controller Experimental Settings]]<br />
[[File:ntm_ltsm_settings.png|frame|center|NTM with LTSM Controller Experimental Settings]]<br />
[[File:ltsm_settings.png|frame|center|LTSM Controller Experimental Settings]]<br />
<br />
In all tasks the combination of NTM with Feedforward or LTSM converges faster and obtains better generalization than a pure LSTM controller.<br />
<br />
<br />
= Discussion =<br />
* While the experimental results show great promise of the NTM architecture, the paper could be serviced with a more in-depth experimental results discussion as to why NTM performs very well when combined with Feedforward or LTSM compared to a pure LTSM.<br />
<br />
* The convergence performance difference between choosing NTM controller with FeedForward vs LTSM appears to hinge on whether the task requires LTSM's internal memory or NTM's external memory as an effective way to store data. Otherwise both controllers are comparable in terms of performance with each other.<br />
<br />
* A bit skeptical about the efforts in tuning LTSM, the paper gives the feeling that the authors spent a lot of time tuning the NTM with different number of heads and controller size in order to achieve desired results for publication.<br />
<br />
* Interested in knowing quantitatively how it would compare against other algorithms such as [https://en.wikipedia.org/wiki/Genetic_programming Genetic Programming] to evolve a turing machines <ref>Naidoo, Amashini, and Nelishia Pillay. "Using genetic programming for turing machine induction." Genetic Programming. Springer Berlin Heidelberg, 2008. 350-361.</ref>, where by its output is a "Program" which theoretically should be better because it doesn't use weights, the "Program" should be more robust, and will require a lot less parameters.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_difficulty_of_training_recurrent_neural_networks&diff=27055on the difficulty of training recurrent neural networks2015-12-03T18:13:08Z<p>Trttse: /* From a geometric perspective */</p>
<hr />
<div>= Introduction =<br />
<br />
Training Recurrent Neural Networks (RNNs) is difficult, one of the most prominent problem in training RNNs has been the vanishing and exploding gradient problem <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages<br />
1183–1188. IEEE, 1993.</ref> which prevents nerual networks from learning and fitting the data. In this paper the authors propose a gradient norm cliping stragtegy to deal with exploding gradients and a soft constraint for the vanishing gradients problem.<br />
<br />
= Background =<br />
<br />
[[Image:rnn_2.png|frame|center|400px|alt=| Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>]]<br />
A generic recurrent neural network has the form:<br />
<br />
<math>x_{t} = F(\mathbf{x}_{t -1}, \mathbf{u}_{t}, \theta)</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{x}_{t}</math> is the state at time <math>t</math></span><br />
* <span><math>\mathbf{u}_{t}</math> is the input at time <math>t</math></span><br />
* <span><math>\theta\,</math> are the parameters</span><br />
* <span><math>F()\,</math> is the function that represents a neuron</span><br />
<br />
In the theoreical sections the authors made use of specific parameterization:<br />
<br />
<math>\mathbf{x}_{t} = \mathbf{W}_{rec} \sigma(\mathbf{x}_{t - 1}) + \mathbf{W}_{in} \mathbf{u}_{t} + b</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{W}_{rec}</math> is the RNN weights matrix</span><br />
* <span><math>\sigma()\,</math> is an element wise function</span><br />
* <span><math>b\,</math> is the bias</span><br />
* <span><math>\mathbf{W}_{in}</math> is the input weights matrix</span><br />
<br />
The following are gradients equations for using the Back Propagation Through Time (BPTT) algorithm, the authors rewrote the equations in order to highlight the exploding gradents problem:<br />
<br />
<math>\frac{\partial \varepsilon}{\partial \theta} = <br />
\sum_{1 \leq t \leq T} \frac{\partial \varepsilon_t}{\partial \theta}</math><br />
<br />
<math>\frac{\partial \varepsilon_{t}}{\partial \theta} = <br />
\sum_{1 \leq k \leq T} <br />
\left(<br />
\frac{\partial \varepsilon_{t}}{\partial x_{t}}<br />
\frac{\partial x_{t}}{\partial x_{k}}<br />
\frac{\partial^{+} x_{k}}{\partial \theta}<br />
\right)</math><br />
<br />
<math>\frac{\partial x_{t}}{\partial x_{k}} =<br />
\prod_{t \leq i \leq k} \frac{\partial x_{i}}{\partial x_{i - 1}} =<br />
\prod_{t \leq i \leq k} <br />
\mathbf{W}^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1}))</math><br />
<br />
Where:<br />
<br />
* <span><math>\varepsilon_{t}</math> is the error obtained from output at time <math>t</math></span><br />
* <span><math>\frac{\partial^{+} \mathbf{x}_{k}}{\partial \theta}</math> is the immediate partial derivative of state <math>\mathbf{x}_{k}</math></span>. For the parametrization above, <math>\frac{\partial^+ \mathbf{x}_k}{\partial \mathbf{W}_{rec}} = \sigma(\mathbf{x}_{k-1})</math>.<br />
<br />
The authors of this paper also distinguish between ''long-term'' and ''short-term'' contributions to the gradient with respect to <math>\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_k}</math>. The contribution is ''long-term'' if <math>k \ll t</math>, and ''short-term'' otherwise.<br />
<br />
== Exploding and Vanishing Gradients ==<br />
=== The Mechanics ===<br />
<br />
It's known that <math> |\sigma^'(x)| </math> is bounded. Let <math>\left|\left|diag(\sigma^'(x_k))\right|\right| \leq \gamma \in R</math>.<br />
<br />
The paper first proves that it is sufficient for <math> \lambda_1 < \frac{1}{\gamma} </math>, where <math> \lambda_1 </math> is the largest singular value of <math> \bold{W}_{rec} </math>, for the vanishing gradient problem to occur. The Jacobian matrix <math> \frac{\partial x_{k+1}}{\partial x_k} </math> is given by <math> \bold{ W}_{rec}^{T}diag(\sigma^'(x_k)) </math>. Then, the 2-norm of this Jacobian is bounded by the product of the norms of the two matrices. This leads to <math> \forall k, ||\frac{\partial{x_{k+1}}}{\partial x_k}|| \leq ||\bold{W}_{rec}^T||||diag(\sigma^'(x_k))|| < \frac{1}{\gamma}\gamma < 1</math><br />
<br />
Let <math>\eta \in R</math> be such that <math>\forall k, ||\frac{\partial {x_{k+1}}}{\partial x_k}|| \leq \eta < 1</math>. By induction over <math>i</math>, we can show that <math>||\frac{\partial \varepsilon_t}{\partial x_t}(\prod_{i=k}^{t-1}{\frac{\partial x_{i+1}}{\partial x_i}})|| \leq \eta^{t-k}||\frac{\partial \varepsilon_t}{\partial x_t}||</math>. Since <math> \eta < 1 </math>, as <math> t-k </math> goes larger, the gradient goes to 0.<br />
<br />
By inverting this proof, it also shows that when the largest singular value <math>\lambda_1 </math> is larger than <math> \frac{1}{\gamma}</math>, we will have exploding gradients (otherwise the long term components would vanish instead of exploding).<br />
<br />
=== From a dynamical systems perspective ===<br />
<br />
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “<math>\theta</math> changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.<br />
<br />
The authors of this paper argue, however, that crossing these bifurcation points does not guarantee a sudden chage in gradients. Their idea is that a change to the model parameters can alter the attractor landscape in such a way that basin of attraction corresponding to the current model state is unaltered. For example, a change to the model parameters might eliminate a basic of attraction in a portion of the model's state space that is very far from its current state. In this case, the bifurcation will have no effect on the asymptotic behaviour of the model, and there will accordingly be no gradient explosion. On the other hand, if a change to the model parameters substantially alters the final basin of attraction given the current state, then there will a considerable effect on the asymptotic behaviour of the model, and the gradients will accordingly explode. <br />
<br />
[[Image:dynamic_perspective.png|frame|center|400px|alt=|Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>]]<br />
<br />
The figure above depicts a bifurcation diagram for a single-unit RNN, where the x-axis is the parameter <math>b</math> (bias) and the y-axis is the asymptotoc state <math>x_{\infty}</math>, (i.e. the equilibrium activation value of the unit), and the plot line is the movement of the final point attractor <math>x_{\infty}</math> as <math>b</math> changes. What this figure represents is the presence of two attractors, one emerging from <math>b_1</math> and another disappearing at <math>b_2</math>, as the value of <math>b</math> is decreased. Note that only one attractor exists when the value of <math>b</math> is outside of the interval between <math>b_1</math> and <math>b_2</math>, and that when two attractors exist, the attractor state towards which the unit ultimately gravitates is determined by its initial starting state. The boundary between the these two basins of attraction is denoted with the dashed line - starting states on opposite sides of this boundary will gravitate towards different attractor states. The blue filled circles indicate a bifurcation point at which a small change to the value of <math>b</math> can have a drastic effect on the attractor landscape over the unit's state space. In short, the landscape shifts to include a single attractor state for a low value of <math>x</math>. The unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time <math>0</math>, a small change in <math>b</math> would result in a sudden large change in <math>x_{t}</math>.<br />
<br />
Overall, these remarks indicate that, when treated as dynamical system, the behaviour of a RNN can be analyzed with respect to both changes to the parameter values that determine an attractor landscape over its state space (assuming a fixed starting state), and with respect to changes to the starting state (assuming a fixed attractor landscape). <br />
<br />
=== From a geometric perspective ===<br />
<br />
Aside from a dynamical systems prospective to exploding and vanishing gradients, the authors also considered a geometric perspective, where a simple one hidden unit RNN was considered.<br />
<br />
[[Image:geometric_perspective.png|frame|center|400px|alt=|Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>]]<br />
<br />
Reusing the RNN state equation the authors defined:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
By assuming no input, with <math>b = 0</math> and initial state <math>x_{0}</math>, the equation simplifies to:<br />
<br />
<math>x_{t} = W_{rec}^{t} x_{0}</math><br />
<br />
Differentiating the above equation to the first and second order would give:<br />
<br />
<math>\frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0}</math><br />
<br />
<math>\frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0}</math><br />
<br />
Which implies if the first order derivative explodes so will the second derivative. In the general case, when they gradients explode they do so along some directions '''v'''. If this bound is tight, it is hypothesized that ''when gradients explode so does the curvature along'' '''v''''', leading to a wall in the error surface'', like the one seen above. If both the gradient and the leading eigenvector of the curvature are aligned with the exploding direction '''v''', it follows that the error surface has a steep wall perpendicular to '''v''' (and consequently to the gradient). This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away, possibly hindering the learning process. (See figure above). Note that this solution assumes that the valley bordered by a steep cliff in the value of the loss function is wide enough with respect the clip being applied to the gradient - otherwise, the deflection caused by an update of SGD would still hinder the learning process, even when clipping is used. The practical effectiveness of clipping provides some evidence in support of this assumption.<br />
<br />
= Dealing with Exploding and Vanishing Gradient =<br />
<br />
== Related Work ==<br />
<br />
* <span>'''L1/L2 Regularization''': Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.</span><br />
* <span>'''Teacher Forcing''': Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.</span><br />
* <span>'''LTSM''': The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of <math>1</math>. This solution however does not deal with the exploding gradient</span><br />
* <span>'''Hessian-Free optimizer with structural damping''': Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.</span><br />
* <span>'''Echo State Networks''': <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.</span><br />
<br />
== Scaling Down the Gradients ==<br />
<br />
[[Image:gradient_clipping.png|frame|center|400px]]<br />
<br />
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.<br />
<br />
== Vanishing Gradient Regularization ==<br />
<br />
The vanishing gradient regularizer is as follows:<br />
<br />
<math>\Omega <br />
= \sum_{k} \Omega_{k} <br />
= \sum_{k} <br />
\left( <br />
\frac{<br />
\| <br />
\frac{\delta \varepsilon}{\delta x_{k + 1}} <br />
\frac{\delta x_{k + 1}}{\delta x_{k}}<br />
\|<br />
}<br />
{<br />
\|<br />
\frac{\delta \varepsilon}{\delta x_{k + 1}}<br />
\| <br />
} - 1<br />
\right)^{2}</math><br />
<br />
The vanishing gradient problem occurs when at time <math>t</math> the inputs <math>u</math> may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs <math>u_{t} \dots u_{k}</math> could be increased by increasing the norm of <math>\frac{\delta x_t}{\delta x_{t}}</math>. This imples that in order to increses the <math>\frac{\delta x_t}{\delta x_{t}}</math> norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices <math>\frac{\delta x_{k + 1}}{\delta x_{k}}</math> to preserve norm in the direction of the error <math>\frac{\delta \varepsilon}{\delta x_{k + 1}}</math>.<br />
<br />
= Experimental Results =<br />
<br />
== The Temporal Order Problem ==<br />
<br />
The authors repeated the temporal order problem as the prototypical pathological problem for validating the cliping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an <math>A</math> or a <math>B</math> symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of <math>A</math> and <math>B</math> at the end of the sequence.<br />
<br />
Three different RNN intializations were performed for the experiment:<br />
<br />
* <span>'''sigmoid unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
* <span>'''basic tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1)</math></span><br />
* <span>'''smart tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
<br />
Of the three RNN networks three different optimizer configurations were used:<br />
<br />
* <span>'''MSGD''': Mini-batch Stochastic Gradient Decent</span><br />
* <span>'''MSGD-C''': MSGD with Gradient Clipping</span><br />
* <span>'''MSGD-CR''': MSGD-C with Regularization</span><br />
<br />
Additional model parameters include:<br />
<br />
* <span>Spectral Radius of 0.95</span><br />
* <span><math>b = b_{out} = 0</math></span><br />
* <span>50 hidden neurons</span><br />
* <span>constant learning rate of 0.01</span><br />
* <span>clipping threshold of 1 (only for MSGD-C and MSGD-CR)</span><br />
* <span>regularization weight of 4 (MSGD-CR)</span><br />
<br />
The experiment was performed 5 times, from the figure below we can observe the importance of gradient cliping and the regularizer, in all cases the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding graidents correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important.<br />
<br />
[[Image:experimental_results.png|frame|center|400px]]<br />
<br />
== Other Pathological Problems ==<br />
<br />
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:<br />
<br />
[[Image:experimental_results_2.png|image]]<br />
<br />
The authors did not discuss the experimental results in detail here.<br />
<br />
= Summary =<br />
<br />
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer, their experimental results show that in all cases except for the Penn Treebank dataset, that cliping and regularizer has bested the state of the art for RNN in their respective experiment performances.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_difficulty_of_training_recurrent_neural_networks&diff=27054on the difficulty of training recurrent neural networks2015-12-03T18:12:46Z<p>Trttse: /* From a geometric perspective */</p>
<hr />
<div>= Introduction =<br />
<br />
Training Recurrent Neural Networks (RNNs) is difficult, one of the most prominent problem in training RNNs has been the vanishing and exploding gradient problem <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages<br />
1183–1188. IEEE, 1993.</ref> which prevents nerual networks from learning and fitting the data. In this paper the authors propose a gradient norm cliping stragtegy to deal with exploding gradients and a soft constraint for the vanishing gradients problem.<br />
<br />
= Background =<br />
<br />
[[Image:rnn_2.png|frame|center|400px|alt=| Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>]]<br />
A generic recurrent neural network has the form:<br />
<br />
<math>x_{t} = F(\mathbf{x}_{t -1}, \mathbf{u}_{t}, \theta)</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{x}_{t}</math> is the state at time <math>t</math></span><br />
* <span><math>\mathbf{u}_{t}</math> is the input at time <math>t</math></span><br />
* <span><math>\theta\,</math> are the parameters</span><br />
* <span><math>F()\,</math> is the function that represents a neuron</span><br />
<br />
In the theoreical sections the authors made use of specific parameterization:<br />
<br />
<math>\mathbf{x}_{t} = \mathbf{W}_{rec} \sigma(\mathbf{x}_{t - 1}) + \mathbf{W}_{in} \mathbf{u}_{t} + b</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{W}_{rec}</math> is the RNN weights matrix</span><br />
* <span><math>\sigma()\,</math> is an element wise function</span><br />
* <span><math>b\,</math> is the bias</span><br />
* <span><math>\mathbf{W}_{in}</math> is the input weights matrix</span><br />
<br />
The following are gradients equations for using the Back Propagation Through Time (BPTT) algorithm, the authors rewrote the equations in order to highlight the exploding gradents problem:<br />
<br />
<math>\frac{\partial \varepsilon}{\partial \theta} = <br />
\sum_{1 \leq t \leq T} \frac{\partial \varepsilon_t}{\partial \theta}</math><br />
<br />
<math>\frac{\partial \varepsilon_{t}}{\partial \theta} = <br />
\sum_{1 \leq k \leq T} <br />
\left(<br />
\frac{\partial \varepsilon_{t}}{\partial x_{t}}<br />
\frac{\partial x_{t}}{\partial x_{k}}<br />
\frac{\partial^{+} x_{k}}{\partial \theta}<br />
\right)</math><br />
<br />
<math>\frac{\partial x_{t}}{\partial x_{k}} =<br />
\prod_{t \leq i \leq k} \frac{\partial x_{i}}{\partial x_{i - 1}} =<br />
\prod_{t \leq i \leq k} <br />
\mathbf{W}^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1}))</math><br />
<br />
Where:<br />
<br />
* <span><math>\varepsilon_{t}</math> is the error obtained from output at time <math>t</math></span><br />
* <span><math>\frac{\partial^{+} \mathbf{x}_{k}}{\partial \theta}</math> is the immediate partial derivative of state <math>\mathbf{x}_{k}</math></span>. For the parametrization above, <math>\frac{\partial^+ \mathbf{x}_k}{\partial \mathbf{W}_{rec}} = \sigma(\mathbf{x}_{k-1})</math>.<br />
<br />
The authors of this paper also distinguish between ''long-term'' and ''short-term'' contributions to the gradient with respect to <math>\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_k}</math>. The contribution is ''long-term'' if <math>k \ll t</math>, and ''short-term'' otherwise.<br />
<br />
== Exploding and Vanishing Gradients ==<br />
=== The Mechanics ===<br />
<br />
It's known that <math> |\sigma^'(x)| </math> is bounded. Let <math>\left|\left|diag(\sigma^'(x_k))\right|\right| \leq \gamma \in R</math>.<br />
<br />
The paper first proves that it is sufficient for <math> \lambda_1 < \frac{1}{\gamma} </math>, where <math> \lambda_1 </math> is the largest singular value of <math> \bold{W}_{rec} </math>, for the vanishing gradient problem to occur. The Jacobian matrix <math> \frac{\partial x_{k+1}}{\partial x_k} </math> is given by <math> \bold{ W}_{rec}^{T}diag(\sigma^'(x_k)) </math>. Then, the 2-norm of this Jacobian is bounded by the product of the norms of the two matrices. This leads to <math> \forall k, ||\frac{\partial{x_{k+1}}}{\partial x_k}|| \leq ||\bold{W}_{rec}^T||||diag(\sigma^'(x_k))|| < \frac{1}{\gamma}\gamma < 1</math><br />
<br />
Let <math>\eta \in R</math> be such that <math>\forall k, ||\frac{\partial {x_{k+1}}}{\partial x_k}|| \leq \eta < 1</math>. By induction over <math>i</math>, we can show that <math>||\frac{\partial \varepsilon_t}{\partial x_t}(\prod_{i=k}^{t-1}{\frac{\partial x_{i+1}}{\partial x_i}})|| \leq \eta^{t-k}||\frac{\partial \varepsilon_t}{\partial x_t}||</math>. Since <math> \eta < 1 </math>, as <math> t-k </math> goes larger, the gradient goes to 0.<br />
<br />
By inverting this proof, it also shows that when the largest singular value <math>\lambda_1 </math> is larger than <math> \frac{1}{\gamma}</math>, we will have exploding gradients (otherwise the long term components would vanish instead of exploding).<br />
<br />
=== From a dynamical systems perspective ===<br />
<br />
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “<math>\theta</math> changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.<br />
<br />
The authors of this paper argue, however, that crossing these bifurcation points does not guarantee a sudden chage in gradients. Their idea is that a change to the model parameters can alter the attractor landscape in such a way that basin of attraction corresponding to the current model state is unaltered. For example, a change to the model parameters might eliminate a basic of attraction in a portion of the model's state space that is very far from its current state. In this case, the bifurcation will have no effect on the asymptotic behaviour of the model, and there will accordingly be no gradient explosion. On the other hand, if a change to the model parameters substantially alters the final basin of attraction given the current state, then there will a considerable effect on the asymptotic behaviour of the model, and the gradients will accordingly explode. <br />
<br />
[[Image:dynamic_perspective.png|frame|center|400px|alt=|Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>]]<br />
<br />
The figure above depicts a bifurcation diagram for a single-unit RNN, where the x-axis is the parameter <math>b</math> (bias) and the y-axis is the asymptotoc state <math>x_{\infty}</math>, (i.e. the equilibrium activation value of the unit), and the plot line is the movement of the final point attractor <math>x_{\infty}</math> as <math>b</math> changes. What this figure represents is the presence of two attractors, one emerging from <math>b_1</math> and another disappearing at <math>b_2</math>, as the value of <math>b</math> is decreased. Note that only one attractor exists when the value of <math>b</math> is outside of the interval between <math>b_1</math> and <math>b_2</math>, and that when two attractors exist, the attractor state towards which the unit ultimately gravitates is determined by its initial starting state. The boundary between the these two basins of attraction is denoted with the dashed line - starting states on opposite sides of this boundary will gravitate towards different attractor states. The blue filled circles indicate a bifurcation point at which a small change to the value of <math>b</math> can have a drastic effect on the attractor landscape over the unit's state space. In short, the landscape shifts to include a single attractor state for a low value of <math>x</math>. The unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time <math>0</math>, a small change in <math>b</math> would result in a sudden large change in <math>x_{t}</math>.<br />
<br />
Overall, these remarks indicate that, when treated as dynamical system, the behaviour of a RNN can be analyzed with respect to both changes to the parameter values that determine an attractor landscape over its state space (assuming a fixed starting state), and with respect to changes to the starting state (assuming a fixed attractor landscape). <br />
<br />
=== From a geometric perspective ===<br />
<br />
Aside from a dynamical systems prospective to exploding and vanishing gradients, the authors also considered a geometric perspective, where a simple one hidden unit RNN was considered.<br />
<br />
[[Image:geometric_perspective.png|frame|center|400px|alt=|Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>]]<br />
<br />
Reusing the RNN state equation the authors defined:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
By assuming no input, with <math>b = 0</math> and initial state <math>x_{0}</math>, the equation simplifies to:<br />
<br />
<math>x_{t} = W_{rec}^{t} x_{0}</math><br />
<br />
Differentiating the above equation to the first and second order would give:<br />
<br />
<math>\frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0}</math><br />
<br />
<math>\frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0}</math><br />
<br />
Which implies if the first order derivative explodes so will the second derivative. In the general case, when they gradients explode they do so along some directions '''v'''. If this bound is tight, it is hypothesized that ''when gradients explode so does the curvature along'' '''v''''', leading to a wall in the error surface'', like the one seen above. If both the gradient and the leading eigenvector of the curvature are aligned with the exploding direction '''v''', it follows that the error surface has a steep wall perpendicular to '''v''' (and consequently to the gradient).<br />
<br />
<br />
<br />
This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away, possibly hindering the learning process. (See figure above). Note that this solution assumes that the valley bordered by a steep cliff in the value of the loss function is wide enough with respect the clip being applied to the gradient - otherwise, the deflection caused by an update of SGD would still hinder the learning process, even when clipping is used. The practical effectiveness of clipping provides some evidence in support of this assumption.<br />
<br />
= Dealing with Exploding and Vanishing Gradient =<br />
<br />
== Related Work ==<br />
<br />
* <span>'''L1/L2 Regularization''': Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.</span><br />
* <span>'''Teacher Forcing''': Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.</span><br />
* <span>'''LTSM''': The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of <math>1</math>. This solution however does not deal with the exploding gradient</span><br />
* <span>'''Hessian-Free optimizer with structural damping''': Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.</span><br />
* <span>'''Echo State Networks''': <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.</span><br />
<br />
== Scaling Down the Gradients ==<br />
<br />
[[Image:gradient_clipping.png|frame|center|400px]]<br />
<br />
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.<br />
<br />
== Vanishing Gradient Regularization ==<br />
<br />
The vanishing gradient regularizer is as follows:<br />
<br />
<math>\Omega <br />
= \sum_{k} \Omega_{k} <br />
= \sum_{k} <br />
\left( <br />
\frac{<br />
\| <br />
\frac{\delta \varepsilon}{\delta x_{k + 1}} <br />
\frac{\delta x_{k + 1}}{\delta x_{k}}<br />
\|<br />
}<br />
{<br />
\|<br />
\frac{\delta \varepsilon}{\delta x_{k + 1}}<br />
\| <br />
} - 1<br />
\right)^{2}</math><br />
<br />
The vanishing gradient problem occurs when at time <math>t</math> the inputs <math>u</math> may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs <math>u_{t} \dots u_{k}</math> could be increased by increasing the norm of <math>\frac{\delta x_t}{\delta x_{t}}</math>. This imples that in order to increses the <math>\frac{\delta x_t}{\delta x_{t}}</math> norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices <math>\frac{\delta x_{k + 1}}{\delta x_{k}}</math> to preserve norm in the direction of the error <math>\frac{\delta \varepsilon}{\delta x_{k + 1}}</math>.<br />
<br />
= Experimental Results =<br />
<br />
== The Temporal Order Problem ==<br />
<br />
The authors repeated the temporal order problem as the prototypical pathological problem for validating the cliping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an <math>A</math> or a <math>B</math> symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of <math>A</math> and <math>B</math> at the end of the sequence.<br />
<br />
Three different RNN intializations were performed for the experiment:<br />
<br />
* <span>'''sigmoid unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
* <span>'''basic tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1)</math></span><br />
* <span>'''smart tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
<br />
Of the three RNN networks three different optimizer configurations were used:<br />
<br />
* <span>'''MSGD''': Mini-batch Stochastic Gradient Decent</span><br />
* <span>'''MSGD-C''': MSGD with Gradient Clipping</span><br />
* <span>'''MSGD-CR''': MSGD-C with Regularization</span><br />
<br />
Additional model parameters include:<br />
<br />
* <span>Spectral Radius of 0.95</span><br />
* <span><math>b = b_{out} = 0</math></span><br />
* <span>50 hidden neurons</span><br />
* <span>constant learning rate of 0.01</span><br />
* <span>clipping threshold of 1 (only for MSGD-C and MSGD-CR)</span><br />
* <span>regularization weight of 4 (MSGD-CR)</span><br />
<br />
The experiment was performed 5 times, from the figure below we can observe the importance of gradient cliping and the regularizer, in all cases the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding graidents correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important.<br />
<br />
[[Image:experimental_results.png|frame|center|400px]]<br />
<br />
== Other Pathological Problems ==<br />
<br />
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:<br />
<br />
[[Image:experimental_results_2.png|image]]<br />
<br />
The authors did not discuss the experimental results in detail here.<br />
<br />
= Summary =<br />
<br />
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer, their experimental results show that in all cases except for the Penn Treebank dataset, that cliping and regularizer has bested the state of the art for RNN in their respective experiment performances.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_difficulty_of_training_recurrent_neural_networks&diff=27053on the difficulty of training recurrent neural networks2015-12-03T18:11:31Z<p>Trttse: /* From a geometric perspective */</p>
<hr />
<div>= Introduction =<br />
<br />
Training Recurrent Neural Networks (RNNs) is difficult, one of the most prominent problem in training RNNs has been the vanishing and exploding gradient problem <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages<br />
1183–1188. IEEE, 1993.</ref> which prevents nerual networks from learning and fitting the data. In this paper the authors propose a gradient norm cliping stragtegy to deal with exploding gradients and a soft constraint for the vanishing gradients problem.<br />
<br />
= Background =<br />
<br />
[[Image:rnn_2.png|frame|center|400px|alt=| Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>]]<br />
A generic recurrent neural network has the form:<br />
<br />
<math>x_{t} = F(\mathbf{x}_{t -1}, \mathbf{u}_{t}, \theta)</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{x}_{t}</math> is the state at time <math>t</math></span><br />
* <span><math>\mathbf{u}_{t}</math> is the input at time <math>t</math></span><br />
* <span><math>\theta\,</math> are the parameters</span><br />
* <span><math>F()\,</math> is the function that represents a neuron</span><br />
<br />
In the theoreical sections the authors made use of specific parameterization:<br />
<br />
<math>\mathbf{x}_{t} = \mathbf{W}_{rec} \sigma(\mathbf{x}_{t - 1}) + \mathbf{W}_{in} \mathbf{u}_{t} + b</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{W}_{rec}</math> is the RNN weights matrix</span><br />
* <span><math>\sigma()\,</math> is an element wise function</span><br />
* <span><math>b\,</math> is the bias</span><br />
* <span><math>\mathbf{W}_{in}</math> is the input weights matrix</span><br />
<br />
The following are gradients equations for using the Back Propagation Through Time (BPTT) algorithm, the authors rewrote the equations in order to highlight the exploding gradents problem:<br />
<br />
<math>\frac{\partial \varepsilon}{\partial \theta} = <br />
\sum_{1 \leq t \leq T} \frac{\partial \varepsilon_t}{\partial \theta}</math><br />
<br />
<math>\frac{\partial \varepsilon_{t}}{\partial \theta} = <br />
\sum_{1 \leq k \leq T} <br />
\left(<br />
\frac{\partial \varepsilon_{t}}{\partial x_{t}}<br />
\frac{\partial x_{t}}{\partial x_{k}}<br />
\frac{\partial^{+} x_{k}}{\partial \theta}<br />
\right)</math><br />
<br />
<math>\frac{\partial x_{t}}{\partial x_{k}} =<br />
\prod_{t \leq i \leq k} \frac{\partial x_{i}}{\partial x_{i - 1}} =<br />
\prod_{t \leq i \leq k} <br />
\mathbf{W}^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1}))</math><br />
<br />
Where:<br />
<br />
* <span><math>\varepsilon_{t}</math> is the error obtained from output at time <math>t</math></span><br />
* <span><math>\frac{\partial^{+} \mathbf{x}_{k}}{\partial \theta}</math> is the immediate partial derivative of state <math>\mathbf{x}_{k}</math></span>. For the parametrization above, <math>\frac{\partial^+ \mathbf{x}_k}{\partial \mathbf{W}_{rec}} = \sigma(\mathbf{x}_{k-1})</math>.<br />
<br />
The authors of this paper also distinguish between ''long-term'' and ''short-term'' contributions to the gradient with respect to <math>\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_k}</math>. The contribution is ''long-term'' if <math>k \ll t</math>, and ''short-term'' otherwise.<br />
<br />
== Exploding and Vanishing Gradients ==<br />
=== The Mechanics ===<br />
<br />
It's known that <math> |\sigma^'(x)| </math> is bounded. Let <math>\left|\left|diag(\sigma^'(x_k))\right|\right| \leq \gamma \in R</math>.<br />
<br />
The paper first proves that it is sufficient for <math> \lambda_1 < \frac{1}{\gamma} </math>, where <math> \lambda_1 </math> is the largest singular value of <math> \bold{W}_{rec} </math>, for the vanishing gradient problem to occur. The Jacobian matrix <math> \frac{\partial x_{k+1}}{\partial x_k} </math> is given by <math> \bold{ W}_{rec}^{T}diag(\sigma^'(x_k)) </math>. Then, the 2-norm of this Jacobian is bounded by the product of the norms of the two matrices. This leads to <math> \forall k, ||\frac{\partial{x_{k+1}}}{\partial x_k}|| \leq ||\bold{W}_{rec}^T||||diag(\sigma^'(x_k))|| < \frac{1}{\gamma}\gamma < 1</math><br />
<br />
Let <math>\eta \in R</math> be such that <math>\forall k, ||\frac{\partial {x_{k+1}}}{\partial x_k}|| \leq \eta < 1</math>. By induction over <math>i</math>, we can show that <math>||\frac{\partial \varepsilon_t}{\partial x_t}(\prod_{i=k}^{t-1}{\frac{\partial x_{i+1}}{\partial x_i}})|| \leq \eta^{t-k}||\frac{\partial \varepsilon_t}{\partial x_t}||</math>. Since <math> \eta < 1 </math>, as <math> t-k </math> goes larger, the gradient goes to 0.<br />
<br />
By inverting this proof, it also shows that when the largest singular value <math>\lambda_1 </math> is larger than <math> \frac{1}{\gamma}</math>, we will have exploding gradients (otherwise the long term components would vanish instead of exploding).<br />
<br />
=== From a dynamical systems perspective ===<br />
<br />
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “<math>\theta</math> changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.<br />
<br />
The authors of this paper argue, however, that crossing these bifurcation points does not guarantee a sudden chage in gradients. Their idea is that a change to the model parameters can alter the attractor landscape in such a way that basin of attraction corresponding to the current model state is unaltered. For example, a change to the model parameters might eliminate a basic of attraction in a portion of the model's state space that is very far from its current state. In this case, the bifurcation will have no effect on the asymptotic behaviour of the model, and there will accordingly be no gradient explosion. On the other hand, if a change to the model parameters substantially alters the final basin of attraction given the current state, then there will a considerable effect on the asymptotic behaviour of the model, and the gradients will accordingly explode. <br />
<br />
[[Image:dynamic_perspective.png|frame|center|400px|alt=|Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>]]<br />
<br />
The figure above depicts a bifurcation diagram for a single-unit RNN, where the x-axis is the parameter <math>b</math> (bias) and the y-axis is the asymptotoc state <math>x_{\infty}</math>, (i.e. the equilibrium activation value of the unit), and the plot line is the movement of the final point attractor <math>x_{\infty}</math> as <math>b</math> changes. What this figure represents is the presence of two attractors, one emerging from <math>b_1</math> and another disappearing at <math>b_2</math>, as the value of <math>b</math> is decreased. Note that only one attractor exists when the value of <math>b</math> is outside of the interval between <math>b_1</math> and <math>b_2</math>, and that when two attractors exist, the attractor state towards which the unit ultimately gravitates is determined by its initial starting state. The boundary between the these two basins of attraction is denoted with the dashed line - starting states on opposite sides of this boundary will gravitate towards different attractor states. The blue filled circles indicate a bifurcation point at which a small change to the value of <math>b</math> can have a drastic effect on the attractor landscape over the unit's state space. In short, the landscape shifts to include a single attractor state for a low value of <math>x</math>. The unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time <math>0</math>, a small change in <math>b</math> would result in a sudden large change in <math>x_{t}</math>.<br />
<br />
Overall, these remarks indicate that, when treated as dynamical system, the behaviour of a RNN can be analyzed with respect to both changes to the parameter values that determine an attractor landscape over its state space (assuming a fixed starting state), and with respect to changes to the starting state (assuming a fixed attractor landscape). <br />
<br />
=== From a geometric perspective ===<br />
<br />
Aside from a dynamical systems prospective to exploding and vanishing gradients, the authors also considered a geometric perspective, where a simple one hidden unit RNN was considered.<br />
<br />
[[Image:geometric_perspective.png|frame|center|400px|alt=|Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>]]<br />
<br />
Reusing the RNN state equation the authors defined:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
By assuming no input, with <math>b = 0</math> and initial state <math>x_{0}</math>, the equation simplifies to:<br />
<br />
<math>x_{t} = W_{rec}^{t} x_{0}</math><br />
<br />
Differentiating the above equation to the first and second order would give:<br />
<br />
<math>\frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0}</math><br />
<br />
<math>\frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0}</math><br />
<br />
Which implies if the first order derivative explodes so will the second derivative. In the general case, when they gradients explode they do so along some directions \textbf{v}. If this bound is tight, it is hypothesized that \textit{when gradients explode so does the curvature along} \textbf{v}\textit{, leading to a wall in the error surface}, like the one seen above. If both the gradient and the leading eigenvector of the curvature are aligned with the exploding direction \textbf{v}, it follows that the error surface has a steep wall perpendicular to \textbf{v} (and consequently to the gradient).<br />
<br />
<br />
<br />
This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away, possibly hindering the learning process. (See figure above). Note that this solution assumes that the valley bordered by a steep cliff in the value of the loss function is wide enough with respect the clip being applied to the gradient - otherwise, the deflection caused by an update of SGD would still hinder the learning process, even when clipping is used. The practical effectiveness of clipping provides some evidence in support of this assumption.<br />
<br />
= Dealing with Exploding and Vanishing Gradient =<br />
<br />
== Related Work ==<br />
<br />
* <span>'''L1/L2 Regularization''': Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.</span><br />
* <span>'''Teacher Forcing''': Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.</span><br />
* <span>'''LTSM''': The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of <math>1</math>. This solution however does not deal with the exploding gradient</span><br />
* <span>'''Hessian-Free optimizer with structural damping''': Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.</span><br />
* <span>'''Echo State Networks''': <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.</span><br />
<br />
== Scaling Down the Gradients ==<br />
<br />
[[Image:gradient_clipping.png|frame|center|400px]]<br />
<br />
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.<br />
<br />
== Vanishing Gradient Regularization ==<br />
<br />
The vanishing gradient regularizer is as follows:<br />
<br />
<math>\Omega <br />
= \sum_{k} \Omega_{k} <br />
= \sum_{k} <br />
\left( <br />
\frac{<br />
\| <br />
\frac{\delta \varepsilon}{\delta x_{k + 1}} <br />
\frac{\delta x_{k + 1}}{\delta x_{k}}<br />
\|<br />
}<br />
{<br />
\|<br />
\frac{\delta \varepsilon}{\delta x_{k + 1}}<br />
\| <br />
} - 1<br />
\right)^{2}</math><br />
<br />
The vanishing gradient problem occurs when at time <math>t</math> the inputs <math>u</math> may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs <math>u_{t} \dots u_{k}</math> could be increased by increasing the norm of <math>\frac{\delta x_t}{\delta x_{t}}</math>. This imples that in order to increses the <math>\frac{\delta x_t}{\delta x_{t}}</math> norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices <math>\frac{\delta x_{k + 1}}{\delta x_{k}}</math> to preserve norm in the direction of the error <math>\frac{\delta \varepsilon}{\delta x_{k + 1}}</math>.<br />
<br />
= Experimental Results =<br />
<br />
== The Temporal Order Problem ==<br />
<br />
The authors repeated the temporal order problem as the prototypical pathological problem for validating the cliping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an <math>A</math> or a <math>B</math> symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of <math>A</math> and <math>B</math> at the end of the sequence.<br />
<br />
Three different RNN intializations were performed for the experiment:<br />
<br />
* <span>'''sigmoid unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
* <span>'''basic tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1)</math></span><br />
* <span>'''smart tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
<br />
Of the three RNN networks three different optimizer configurations were used:<br />
<br />
* <span>'''MSGD''': Mini-batch Stochastic Gradient Decent</span><br />
* <span>'''MSGD-C''': MSGD with Gradient Clipping</span><br />
* <span>'''MSGD-CR''': MSGD-C with Regularization</span><br />
<br />
Additional model parameters include:<br />
<br />
* <span>Spectral Radius of 0.95</span><br />
* <span><math>b = b_{out} = 0</math></span><br />
* <span>50 hidden neurons</span><br />
* <span>constant learning rate of 0.01</span><br />
* <span>clipping threshold of 1 (only for MSGD-C and MSGD-CR)</span><br />
* <span>regularization weight of 4 (MSGD-CR)</span><br />
<br />
The experiment was performed 5 times, from the figure below we can observe the importance of gradient cliping and the regularizer, in all cases the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding graidents correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important.<br />
<br />
[[Image:experimental_results.png|frame|center|400px]]<br />
<br />
== Other Pathological Problems ==<br />
<br />
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:<br />
<br />
[[Image:experimental_results_2.png|image]]<br />
<br />
The authors did not discuss the experimental results in detail here.<br />
<br />
= Summary =<br />
<br />
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer, their experimental results show that in all cases except for the Penn Treebank dataset, that cliping and regularizer has bested the state of the art for RNN in their respective experiment performances.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_difficulty_of_training_recurrent_neural_networks&diff=27045on the difficulty of training recurrent neural networks2015-12-02T23:42:31Z<p>Trttse: /* From a geometric perspective */</p>
<hr />
<div>= Introduction =<br />
<br />
Training Recurrent Neural Networks (RNNs) is difficult, one of the most prominent problem in training RNNs has been the vanishing and exploding gradient problem <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages<br />
1183–1188. IEEE, 1993.</ref> which prevents nerual networks from learning and fitting the data. In this paper the authors propose a gradient norm cliping stragtegy to deal with exploding gradients and a soft constraint for the vanishing gradients problem.<br />
<br />
= Background =<br />
<br />
[[Image:rnn_2.png|frame|center|400px|alt=| Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>]]<br />
A generic recurrent neural network has the form:<br />
<br />
<math>x_{t} = F(\mathbf{x}_{t -1}, \mathbf{u}_{t}, \theta)</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{x}_{t}</math> is the state at time <math>t</math></span><br />
* <span><math>\mathbf{u}_{t}</math> is the input at time <math>t</math></span><br />
* <span><math>\theta\,</math> are the parameters</span><br />
* <span><math>F()\,</math> is the function that represents a neuron</span><br />
<br />
In the theoreical sections the authors made use of specific parameterization:<br />
<br />
<math>\mathbf{x}_{t} = \mathbf{W}_{rec} \sigma(\mathbf{x}_{t - 1}) + \mathbf{W}_{in} \mathbf{u}_{t} + b</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{W}_{rec}</math> is the RNN weights matrix</span><br />
* <span><math>\sigma()\,</math> is an element wise function</span><br />
* <span><math>b\,</math> is the bias</span><br />
* <span><math>\mathbf{W}_{in}</math> is the input weights matrix</span><br />
<br />
The following are gradients equations for using the Back Propagation Through Time (BPTT) algorithm, the authors rewrote the equations in order to highlight the exploding gradents problem:<br />
<br />
<math>\frac{\partial \varepsilon}{\partial \theta} = <br />
\sum_{1 \leq t \leq T} \frac{\partial \varepsilon_t}{\partial \theta}</math><br />
<br />
<math>\frac{\partial \varepsilon_{t}}{\partial \theta} = <br />
\sum_{1 \leq k \leq T} <br />
\left(<br />
\frac{\partial \varepsilon_{t}}{\partial x_{t}}<br />
\frac{\partial x_{t}}{\partial x_{k}}<br />
\frac{\partial^{+} x_{k}}{\partial \theta}<br />
\right)</math><br />
<br />
<math>\frac{\partial x_{t}}{\partial x_{k}} =<br />
\prod_{t \leq i \leq k} \frac{\partial x_{i}}{\partial x_{i - 1}} =<br />
\prod_{t \leq i \leq k} <br />
\mathbf{W}^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1}))</math><br />
<br />
Where:<br />
<br />
* <span><math>\varepsilon_{t}</math> is the error obtained from output at time <math>t</math></span><br />
* <span><math>\frac{\partial^{+} \mathbf{x}_{k}}{\partial \theta}</math> is the immediate partial derivative of state <math>\mathbf{x}_{k}</math></span>. For the parametrization above, <math>\frac{\partial^+ \mathbf{x}_k}{\partial \mathbf{W}_{rec}} = \sigma(\mathbf{x}_{k-1})</math>.<br />
<br />
The authors of this paper also distinguish between ''long-term'' and ''short-term'' contributions to the gradient with respect to <math>\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_k}</math>. The contribution is ''long-term'' if <math>k \ll t</math>, and ''short-term'' otherwise.<br />
<br />
== Exploding and Vanishing Gradients ==<br />
=== The Mechanics ===<br />
<br />
It's known that <math> |\sigma^'(x)| </math> is bounded. Let <math>\left|\left|diag(\sigma^'(x_k))\right|\right| \leq \gamma \in R</math>.<br />
<br />
The paper first proves that it is sufficient for <math> \lambda_1 < \frac{1}{\gamma} </math>, where <math> \lambda_1 </math> is the largest singular value of <math> \bold{W}_{rec} </math>, for the vanishing gradient problem to occur. The Jacobian matrix <math> \frac{\partial x_{k+1}}{\partial x_k} </math> is given by <math> \bold{ W}_{rec}^{T}diag(\sigma^'(x_k)) </math>. Then, the 2-norm of this Jacobian is bounded by the product of the norms of the two matrices. This leads to <math> \forall k, ||\frac{\partial{x_{k+1}}}{\partial x_k}|| \leq ||\bold{W}_{rec}^T||||diag(\sigma^'(x_k))|| < \frac{1}{\gamma}\gamma < 1</math><br />
<br />
Let <math>\eta \in R</math> be such that <math>\forall k, ||\frac{\partial {x_{k+1}}}{\partial x_k}|| \leq \eta < 1</math>. By induction over <math>i</math>, we can show that <math>||\frac{\partial \varepsilon_t}{\partial x_t}(\prod_{i=k}^{t-1}{\frac{\partial x_{i+1}}{\partial x_i}})|| \leq \eta^{t-k}||\frac{\partial \varepsilon_t}{\partial x_t}||</math>. Since <math> \eta < 1 </math>, as <math> t-k </math> goes larger, the gradient goes to 0.<br />
<br />
By inverting this proof, it also shows that when the largest singular value <math>\lambda_1 </math> is larger than <math> \frac{1}{\gamma}</math>, we will have exploding gradients (otherwise the long term components would vanish instead of exploding).<br />
<br />
=== From a dynamical systems perspective ===<br />
<br />
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “<math>\theta</math> changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.<br />
<br />
The authors of this paper argue, however, that crossing these bifurcation points does not guarantee a sudden chage in gradients. Their idea is that a change to the model parameters can alter the attractor landscape in such a way that basin of attraction corresponding to the current model state is unaltered. For example, a change to the model parameters might eliminate a basic of attraction in a portion of the model's state space that is very far from its current state. In this case, the bifurcation will have no effect on the asymptotic behaviour of the model, and there will accordingly be no gradient explosion. On the other hand, if a change to the model parameters substantially alters the final basin of attraction given the current state, then there will a considerable effect on the asymptotic behaviour of the model, and the gradients will accordingly explode. <br />
<br />
[[Image:dynamic_perspective.png|frame|center|400px|alt=|Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>]]<br />
<br />
The figure above depicts a bifurcation diagram for a single-unit RNN, where the x-axis is the parameter <math>b</math> (bias) and the y-axis is the asymptotoc state <math>x_{\infty}</math>, (i.e. the equilibrium activation value of the unit), and the plot line is the movement of the final point attractor <math>x_{\infty}</math> as <math>b</math> changes. What this figure represents is the presence of two attractors, one emerging from <math>b_1</math> and another disappearing at <math>b_2</math>, as the value of <math>b</math> is decreased. Note that only one attractor exists when the value of <math>b</math> is outside of the interval between <math>b_1</math> and <math>b_2</math>, and that when two attractors exist, the attractor state towards which the unit ultimately gravitates is determined by its initial starting state. The boundary between the these two basins of attraction is denoted with the dashed line - starting states on opposite sides of this boundary will gravitate towards different attractor states. The blue filled circles indicate a bifurcation point at which a small change to the value of <math>b</math> can have a drastic effect on the attractor landscape over the unit's state space. In short, the landscape shifts to include a single attractor state for a low value of <math>x</math>. The unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time <math>0</math>, a small change in <math>b</math> would result in a sudden large change in <math>x_{t}</math>.<br />
<br />
Overall, these remarks indicate that, when treated as dynamical system, the behaviour of a RNN can be analyzed with respect to both changes to the parameter values that determine an attractor landscape over its state space (assuming a fixed starting state), and with respect to changes to the starting state (assuming a fixed attractor landscape). <br />
<br />
=== From a geometric perspective ===<br />
<br />
Aside from a dynamical systems prospective to exploding and vanishing gradients, the authors also considered a geometric perspective, where a simple one hidden unit RNN was considered.<br />
<br />
[[Image:geometric_perspective.png|frame|center|400px|alt=|Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>]]<br />
<br />
Reusing the RNN state equation the authors defined:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
By assuming no input, with <math>b = 0</math> and initial state <math>x_{0}</math>, the equation simplifies to:<br />
<br />
<math>x_{t} = W_{rec}^{t} x_{0}</math><br />
<br />
Differentiating the above equation to the first and second order would give:<br />
<br />
<math>\frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0}</math><br />
<br />
<math>\frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0}</math><br />
<br />
Which implies if the first order derivative explodes so will the second derivative. This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away, possibly hindering the learning process. (See figure above). Note that this solution assumes that the valley bordered by a steep cliff in the value of the loss function is wide enough with respect the clip being applied to the gradient - otherwise, the deflection caused by an update of SGD would still hinder the learning process, even when clipping is used. The practical effectiveness of clipping provides some evidence in support of this assumption.<br />
<br />
= Dealing with Exploding and Vanishing Gradient =<br />
<br />
== Related Work ==<br />
<br />
* <span>'''L1/L2 Regularization''': Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.</span><br />
* <span>'''Teacher Forcing''': Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.</span><br />
* <span>'''LTSM''': The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of <math>1</math>. This solution however does not deal with the exploding gradient</span><br />
* <span>'''Hessian-Free optimizer with structural damping''': Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.</span><br />
* <span>'''Echo State Networks''': <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.</span><br />
<br />
== Scaling Down the Gradients ==<br />
<br />
[[Image:gradient_clipping.png|frame|center|400px]]<br />
<br />
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.<br />
<br />
== Vanishing Gradient Regularization ==<br />
<br />
The vanishing gradient regularizer is as follows:<br />
<br />
<math>\Omega <br />
= \sum_{k} \Omega_{k} <br />
= \sum_{k} <br />
\left( <br />
\frac{<br />
\| <br />
\frac{\delta \varepsilon}{\delta x_{k + 1}} <br />
\frac{\delta x_{k + 1}}{\delta x_{k}}<br />
\|<br />
}<br />
{<br />
\|<br />
\frac{\delta \varepsilon}{\delta x_{k + 1}}<br />
\| <br />
} - 1<br />
\right)^{2}</math><br />
<br />
The vanishing gradient problem occurs when at time <math>t</math> the inputs <math>u</math> may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs <math>u_{t} \dots u_{k}</math> could be increased by increasing the norm of <math>\frac{\delta x_t}{\delta x_{t}}</math>. This imples that in order to increses the <math>\frac{\delta x_t}{\delta x_{t}}</math> norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices <math>\frac{\delta x_{k + 1}}{\delta x_{k}}</math> to preserve norm in the direction of the error <math>\frac{\delta \varepsilon}{\delta x_{k + 1}}</math>.<br />
<br />
= Experimental Results =<br />
<br />
== The Temporal Order Problem ==<br />
<br />
The authors repeated the temporal order problem as the prototypical pathological problem for validating the cliping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an <math>A</math> or a <math>B</math> symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of <math>A</math> and <math>B</math> at the end of the sequence.<br />
<br />
Three different RNN intializations were performed for the experiment:<br />
<br />
* <span>'''sigmoid unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
* <span>'''basic tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1)</math></span><br />
* <span>'''smart tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
<br />
Of the three RNN networks three different optimizer configurations were used:<br />
<br />
* <span>'''MSGD''': Mini-batch Stochastic Gradient Decent</span><br />
* <span>'''MSGD-C''': MSGD with Gradient Clipping</span><br />
* <span>'''MSGD-CR''': MSGD-C with Regularization</span><br />
<br />
Additional model parameters include:<br />
<br />
* <span>Spectral Radius of 0.95</span><br />
* <span><math>b = b_{out} = 0</math></span><br />
* <span>50 hidden neurons</span><br />
* <span>constant learning rate of 0.01</span><br />
* <span>clipping threshold of 1 (only for MSGD-C and MSGD-CR)</span><br />
* <span>regularization weight of 4 (MSGD-CR)</span><br />
<br />
The experiment was performed 5 times, from the figure below we can observe the importance of gradient cliping and the regularizer, in all cases the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding graidents correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important.<br />
<br />
[[Image:experimental_results.png|frame|center|400px]]<br />
<br />
== Other Pathological Problems ==<br />
<br />
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:<br />
<br />
[[Image:experimental_results_2.png|image]]<br />
<br />
The authors did not discuss the experimental results in detail here.<br />
<br />
= Summary =<br />
<br />
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer, their experimental results show that in all cases except for the Penn Treebank dataset, that cliping and regularizer has bested the state of the art for RNN in their respective experiment performances.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=memory_Networks&diff=26914memory Networks2015-11-27T01:33:34Z<p>Trttse: /* Related work */</p>
<hr />
<div>= Introduction =<br />
<br />
Most supervised machine learning models are designed to approximate a function that maps input data to a desirable output (e.g. a class label for an image or a translation of a sentence from one language to another). In this sense, <br />
such models perform inference using a 'fixed' memory in the form of a set of parameters learned during training. For example, the memory of a recurrent neural network is constituted largely by the weights on the recurrent connections to its hidden layer (along with the layer's activities). As is well known, this form of memory is inherently limited given the fixed dimensionality of the weights in question. It is largely for this reason that recurrent nets have difficulty learning long-range dependencies in sequential data. Learning such dependencies, note, requires ''remembering'' items in a sequence for a large number of time steps. <br />
<br />
For an interesting class of problems, it is essential for a model to be able to learn long-term dependencies, and to more generally be able to learn to perform inferences using an arbitrarily large memory. Question-answering tasks are paradigmatic of this class of problems, since performing well on such tasks requires remembering all of the information that constitutes a possible answer to the questions being posed. In principle, a recurrent network such as an LSTM could learn to perform QA tasks, but in practice, the amount of information that can be retained by the weights and the hidden states in the LSTM is simply insufficient. <br />
<br />
Given this need for a model architecture the combines inference and memory in a sophisticated manner, the authors of this paper propose what they refer to as a "Memory Network". In brief, a memory network is a model that learns to read and write data to an arbitrarily large long-term memory, while also using the data in this memory to perform inferences. The rest of this summary describes the components of a memory network in greater detail, along with some experiments describing its application to a question answering task involving short stories. Below is an example illustrating the model's ability to answer simple questions after being presented with short, multi-sentence stories. <br />
<br />
[[File:QA_example.png | frame | centre | Example answers (in red) using a memory network for question answering. ]]<br />
<br />
= Model Architecture =<br />
<br />
A memory network is composed of a memory <math>\ m</math> (in the form of a collection of vectors or strings, indexed individually as <math>\ m_i</math>), and four possibly learned functions <math>\ I</math>, <math>\ G</math>, <math>\ O</math>, and <math>\ R</math>. The functions are defined as follows:<br />
*<math>\ I</math> maps a natural language expression onto an 'input' feature representation (e.g., a real-valued vector). The input can either be a fact to be added to the memory <math>\ m</math> (e.g. 'John is at the university') , or a question for which an answer is being sought (e.g. 'Where is John?'). <br />
*<math>\ G</math> updates the contents of the memory <math>\ m</math> on the basis of an input. The updating can involve simply writing the input to new memory location, or it can involve the modification or compression of existing memories to perform a kind of generalization on the state of the memory. <br />
*<math>\ O</math> produces an 'output' feature representation given a new input and the current state of the memory. The input and output feature representations reside in the same embedding space. <br />
*<math>\ R</math> produces a response given an output feature representation. This response is usually a word or a sentence, but in principle it could also be an action of some kind (e.g. the movement of a robot)<br />
<br />
To give a quick overview of how the model operates, an input ''x'' will first be mapped to a feature representation <math>\ I(x)</math> Then, for all memories ''i'', the following update is applied: <math>\ m_i = G(m_i, I(x), m) </math>. This means that each memory is updated on the basis of the input ''x'' and the current state of the memory <math>\ m</math>. In the case where each input is simply written to memory, <math>\ G</math> might function to simply select an index that is currently unused and write <math>\ I(x)</math> to the memory location corresponding to this index. Next, an output feature representation is computed as <math>\ o=O(I(x), m)</math>, and a response, <math>\ r</math>, is computed directly from this feature representation as <math>\ r=R(o)</math>. <math>\ O</math> can be interpreted as retrieving a small selection of memories that are relevant to producing a good response, and <math>\ R</math> actually produces the response given the feature representation produced from the relevant memories by <math>\ O</math>.<br />
<br />
= A Basic Implementation =<br />
<br />
In a simple version of the memory network, input text is just written to memory in unaltered form. Or in other words, <math>\ I(x) </math> simply returns ''x'', and <math>\ G </math> writes this text to a new memory slot <math>\ m_{N+1} </math> if <math>\ N </math> is the number of currently filled slots. The memory is accordingly an array of strings, and the inclusion of a new string does nothing to modify existing strings. <br />
<br />
Given as much, most of the work being done by the model is performed by the functions <math>\ O </math> and <math>\ R </math>. The job of <math>\ O </math> is to produce an output feature representation by selecting <math>\ k </math> supporting memories from <math>\ m </math> on the basis of the input ''x''. In the experiments described in this paper, <math>\ k </math> is set to either 1 or 2. In the case that <math>\ k=1 </math>, the function <math>\ O </math> behaves as follows: <br />
<br />
<br />
:<math>\ o_1 = O_1(x, m) = argmax_{i = 1 ... N} S_O(x, m_i) </math><br />
<br />
<br />
where <math>\ S_O </math> is a function that scores a candidate memory for its compatibility with ''x''. Essentially, one 'supporting' memory is selected from <math>\ m </math> as being most likely to contain the information needed to answer the question posed in <math>\ x </math>. In this case, the output is <math>\ o_1 = [x, m_{o_1}] </math>, or a list containing the input question and one supporting memory. Alternatively, in the case that <math>\ k=2 </math>', a second supporting memory is selected on the basis of the input and the first supporting memory, as follows: <br />
<br />
<br />
:<math>\ o_2 = O_2(x, m) = argmax_{i = 1 ... N} S_O([x, m_{o_1}], m_i) </math><br />
<br />
<br />
Now, the overall output is <math>\ o_2 = [x, m_{o_1}, m_{o_2}] </math>. (These lists are translated into feature representations as described below). Finally, the result of <math>\ O </math> is used to produce a response in the form of a single word via <math>\ R </math> as follows:<br />
<br />
<br />
:<math>\ r = argmax_{w \epsilon W} S_R([x, m_{o_1}, m_{o_2}], w) </math><br />
<br />
<br />
In short, a response is produced by scoring each word in a set of candidate words against the representation produced by the combination of the input and the two supporting memories. The highest scoring candidate word is then chosen as the model's output. The learned portions of <math>\ O </math> and <math>\ R </math> are the parameters of the functions <math>\ S_O </math> and <math>\ S_R </math>, which perform embeddings of the raw text constituting each function argument, and then return the dot product of these two embeddings as a score. Formally, the function <math>\ S_O </math> can be defined as follows; <math>\ S_R </math> is defined analogously:<br />
<br />
<br />
:<math>\ S_O(x, y) = \Phi_x(x)^T U^T U \Phi_y(y) </math><br />
<br />
<br />
In this equation, <math>\ U </math> is an <math>\ n \times D </math> matrix, where ''n'' is the dimension of the embedding space, and ''D'' is the number of features used to represent each function argument. <math>\ \Phi_x</math> and <math>\ \Phi_y </math> are functions that map each argument (which are strings) into the feature space. In the implementations considered in this paper, the feature space makes use of a bag-of-words representation, such that there are 3 binary features for each word in the model's vocabulary. The first feature corresponds to the presence of the word in the input ''x'', the second feature corresponds to the presence of the word in first supporting memory that is being used to select a second supporting memory, and the third feature representation corresponds to the presence of the word in a candidate memory being scored (i.e. either the first or second supporting memory retrieved by the model). Having these different features allows the model to learn distinct representations for the same word depending on whether the word is present in an input question or in a string stored in memory. <br />
<br />
Intuitively, it helps to think of the columns of <math>\ U </math> containing distributed representations of each word in the vocabulary (specifically, there are 3 representations and hence 3 columns devoted to each word). The binary feature representation <math>\ \Phi_x(x)</math> maps the text in ''x'' onto a binary feature vector, where 1's in the vector indicate the presence of a particular word in ''x'', and 0's indicate the absence of this word. Note that different elements of the vector will be set to 1 depending on whether the word occurs in the input ''x'' or in a supporting memory (i.e. when ''x'' is a list containing the input and a supporting memory). The matrix-vector multiplications in the above equation effectively extract and sum the distributed representations corresponding to each of the inputs, ''x'' and ''y''. Thus, a single distributed representation is produced for each input, and the resulting score is the dot product of these two vectors (which in turn is the cosine of the angle between the vectors scaled by the product of the vector norms). In the case where ''x'' is the input query, and ''y'' is a candidate memory, a high dot product indicates that the model thinks that the candidate in question is very relevant to answering the input query. In the case where ''x'' is the output of <math>\ O</math> and ''y'' is a candidate response word, a high dot product indicates that the model thinks that the response word is an appropriate answer given the output feature representation produced by <math>\ O</math>. Distinct embedding matrices <math>\ U_O </math> and <math>\ U_R </math> are used to compute the output feature representation and the response. <br />
<br />
The goal of learning is find embedding matrices in which the representations produced for queries, supporting memories, and responses are spatially related such that representations of relevant supporting memories are close to the representations of a query, and such that representations of individual words are close to the output feature representations of the questions they answer. The method used to perform this learning is described in the next section. <br />
<br />
= The Training Procedure =<br />
<br />
Learning is conducted in a supervised manner; the correct responses and supporting memories for each query are provided during training. The following margin-ranking loss function is used in tandem with stochastic gradient descent to learn the parameters of <math>\ U_O </math> and <math>\ U_R </math>, given an input ''x'', a desired response ''r'', and desired supporting memories, <math>\ m_{o_1}</math> and <math>\ m_{o_2}</math>:<br />
<br />
:<math> \sum_{f \neq m_{o_1}} max(0, \gamma + S_O (x, f) - S_O (x, m_{o_1})) + \sum_{f^' \neq m_{o_2}} max(0, \gamma + S_O ([x, m_{o_1}], f^') - S_O ([x, m_{o_1}], m_{o_2})) + </math><br />
:<math> \sum_{r^' \neq r} max(0, \gamma + S_R ([x, m_{o_1}, m_{o_2}], r^') - S_R ([x, m_{o_1}, m_{o_2}], r)) </math><br />
<br />
where <math>\ f</math>, <math>\ f^'</math> and <math>\ r^'</math> correspond to incorrect candidates for the first supporting memory, the second supporting memory, and the output response, and <math> \gamma</math> corresponds to the margin. Intuitively, each term in the sum penalizes the current parameters in proportion to the number of incorrect memories and responses that get assigned a score within the margin of the score of the correct memories and responses. Or in other words, if the score of a correct candidate memory / response is higher than the score of every incorrect candidate by at least <math> \gamma </math>, the cost is 0. Otherwise, the cost is the sum over all of the differences between the incorrect scores (plus gamma) and the correct score. In fact, this is just the standard hinge loss function. Weston et al. speed up gradient descent by sampling incorrect candidates instead of using all incorrect candidates in the calculation of the gradient for each training example. <br />
<br />
= Extensions to the Basic Implementation = <br />
<br />
Some limitations of the basic implementation are that it can only output single word responses, can only accept strings (rather than sequences) as input, and cannot use its memory in efficient or otherwise interesting ways. The authors propose a series of extensions to the basic implementation described in the previous section that are designed to overcome these limitations. First, they propose a segmenting function that learns when to segment an input sequence into discrete chunks that get written to individual memory slots. The segmenter is modeled similarly to other components, as an embedding model of the form:<br />
<br />
<math><br />
seg(c)=W^T_{seg}U_s\Phi_{seg}(c)<br />
</math><br />
<br />
where <math>W_{seg}</math> is a vector (effectively the parameters of a linear classifier in embedding space), and <math>c</math> is the sequence of input words represented as a bag of words using a separate dictionary. If <math>seg(c) > \gamma</math>, where <math>\gamma</math> is the margin, then this sequence is recognized as a segment.<br />
<br />
Second, they propose the use of hashing to avoid scoring a prohibitively large number of candidate memories. Each input corresponding to a query is hashed into some number of buckets, and only candidates within these buckets are scored during the selection of supporting memories. Hashing is done either by making a bucket per word in the model's vocabulary, or by clustering the learning word embeddings, and creating a bucket per cluster. <br />
<br />
The most important extension proposed by the authors involves incorporating information about the time at which a memory was written into the scoring function <math>/ S_O </math>. The model needs to be able to make use of such information to correctly answer questions such as "Where was John before the university" (assuming the model has been told some story about John). To handle temporal information, the feature space is extend to include features that indicate the relative time between when two items where written to memory. Formally, this yields the following revised scoring function:<br />
<br />
<br />
<math>\ S_{O_t}(x, y, y^') = \Phi_x(x)^T U^T U (\Phi_y(y)-\Phi_y(y^')+\Phi_t(x,y,y^'))</math><br />
<br />
<br />
The novelty here lies in the feature mapping function <math> \Phi_t </math>, which takes an input and two candidate supporting memories, and returns a binary feature vector as before, but with the addition of three features that indicate whether <math>x</math> is older than <math>y</math>, whether <math>x</math> is older than <math>y^'</math>, and whether <math>y</math> is older than <math>y'</math>. The model loops over all candidate memories, comparing candidates <math>y</math> and <math>y^'</math>. If <math> S_{O_t}(x, y, y^') </math> is greater than 0, then <math>y</math> is preferred over <math>y^'</math>; otherwise, <math>y'</math> is preferred. If <math>y'</math> is preferred, <math>y</math> is replaced by <math>y'</math> and the loop continues to the next candidate memory (i.e. the new <math>y^'</math>. Once the loop finishes iterating over the entire memory, the winning candidate <math>y</math> is chosen as the supporting memory. <br />
<br />
Some further extensions concern allowing the model to deal with words not included in it's vocabulary, and to more effectively take advantage of exact word matches between input queries and candidate supporting memories.<br />
<br />
Embedding models cannot efficiently use exact word matches due to the low dimensionality <math>n</math>. One solution is to score a pair <math>x,y</math> with <math>\ \Phi_x(x)U^TU\Phi_y(y)+\lambda\Phi_x(x)^T\Phi_y(y) </math> instead. That is, add the “bag of words” matching score to the learned embedding score (with a mixing parameter λ). Another related way is to stay in the n-dimensional embedding space, but to extend the feature representation D with matching features, e.g., one per word. A matching feature indicates if a word occurs in both x and y. That is, we score with <math>\ \Phi_x(x)U^TU\Phi_y(y,x)</math> where <math>\ \Phi_y</math> is actually built conditionally on x: if some of the words in y match the words in x we set those matching features to 1. Unseen words can be modeled similarly by using matching features on their context words. This then gives a feature space of D = 8|W|.<br />
<br />
= Related work =<br />
<br />
There are two general approaches to performing question answering that have been developed in the literature. The first makes use of a technique known as semantic parsing to map a query expressed in natural language onto a representation in some formal language that directly extracts information from some external memory such as a knowledge base<ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>P. Liang, M. Jordan, and D. Klein. [http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00127 "Learning dependency-based compositional semantics"]. In Computational Linguistics, 39.2, p. 389-446. </ref>. The second makes use of embedding methods to represent queries and candidate answers (typically extracted from a knowledge base) as high-dimensional vectors. Learning involves producing embeddings that place query vectors close to the vectors that correspond to their answers. <ref>Bordes, A., S. Chopra, and J. Weston. [http://www.thespermwhale.com/jaseweston/papers/fbqa.pdf "Question Answering with Subgraph Embeddings"]. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (2014)</ref>. Memory networks fall under the latter approach, and existing variants of this approach can been seen as special cases of the memory network architecture (e.g., <ref>Bordes, A., S. Chopra, and J. Weston. [http://www.thespermwhale.com/jaseweston/papers/fbqa.pdf "Question Answering with Subgraph Embeddings"]. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (2014)</ref>)<br />
<br />
Compared to recent knowledge based approaches memory networks differ in that they do not apply a two-stage strategy: (i) apply information extraction methods to first build the KB; followed by (ii) inference over the the KB. Classical neural network models have also been explored. Typically, this type of memory is distributed across the whole network of weights of the model rather than being compartmentalized into memory locations. Memory networks combine compartmentalized memory with neural network modules that can learn how to read and write to that memory, e.g., to perform reasoning they can interactively read salient facts from the memory.<br />
<br />
= Experimental Results = <br />
<br />
The authors first test a simple memory network (i.e. <math>\ k=1 </math> on a large scale question answering task involving a dataset consisting of 14 million subject-relation-object triplets. Each triplet is stored as an item in memory, and the answers to particular questions are a single entity (i.e. a subject or object) in one these triplets. The results in the table below indicate that memory networks perform quite well on this task. Note that the memory network with 'bag of words' features includes the extension designed to indicate the presence of exact matches of words in a query and a candidate answer. This seems to contribute significantly to improved performance. <br />
<br />
[[File:largescale.png | frame | centre | Results on a large-scale QA task.]]<br />
<br />
Scoring a query against all 14 million candidate memories is slow, so the the authors also test their hashing techniques and report the resulting speed-accuracy tradeoffs. As shown in the figure below, the use of cluster-based hashing results in a negligible drop in performance while considering only 1/80th of the complete set of items stored in memory. <br />
<br />
[[File:hash.png | frame | centre | Memory hashing results on a large-scale QA task.]]<br />
<br />
To test their model on more complex tasks that require chains of inference, the authors create a synthetic dataset consisting approximately 7 thousand statements and 3 thousand questions focused on a toy environment comprised of a 4 people, 3 objects, and 5 rooms. Stories involving multiple statements describing actions performed by these people (e.g. moving an object from one room to another) are used to define the question answering tasks. Questions are focused on a single entity mentioned in a story, and the difficulty of the task is controlled by varying how long ago the most recent mention of this entity is in the story (e.g. the most recent statement in the story vs. the 5th most recent statement in the story). The figure at the top of this page gives an example of these tasks being performed. <br />
<br />
In the results below, 'Difficulty 1' tasks are those in which the entity being asked about was mentioned in the most recent statement of the story, while 'Difficulty 5' tasks are those in which the entity being asked about was mentioned in one of the 5 most recent statements. Questions about an 'actor' concern a statement that mentions a person but not an object (e.g. "John went to the garden"). The questions may ask for the current location of the person (e.g. "where is John?") or the previous location of the person (e.g. "Where was John before the garden?") (the column labelled "actor w/o before" in the figure below excludes this latter type of question). More complex questions involve asking about the object in a statement that mentions both a person and an object (e.g. "John dropped the milk", the question might "Where is the milk?"). Note that this task is more challenging, since it requires using multiple pieces of information (i.e. where John was, and what he did while he was there). Comparisons using RNNs and LSTMs are also reported, and for multiword responses as in the first figure above, an LSTM is used in place of <math>\ R </math><br />
<br />
[[File:toyqa.png | frame | centre | Test accuracy on a simulated world QA task.]]<br />
<br />
What is most notable about these results is that the inclusion of time features in the MemNN seems to be responsible for most of the improvement over RNNs and LSTMs. <br />
<br />
<br />
= Discussion = <br />
<br />
One potential concern about the memory network architecture concerns its generalizability to large values of <math>\ k </math>. To explain, each additional supporting memory increases the number of columns in the embedding matrices by the size of the model's vocabulary. This could become impractical for standard vocabularies with tens of thousands of terms. <br />
<br />
A second concern is that the memory network, as described, is engineered to answer very particular kinds of questions (i.e. questions in which the order of events is important). To handle different kinds of questions, different features would likely need to be added (e.g. quantificational features to handle statements involving quantifiers such as 'some', 'many', etc.). This sort of ad-hoc design calls into question whether the architecture is capable of performing scalable, general-purpose question answering. <br />
<br />
= Resources =<br />
<br />
Memory Network implementations on [https://github.com/facebook/MemNN Github] <br />
<br />
= Bibliography =<br />
<br />
<references /></div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=memory_Networks&diff=26913memory Networks2015-11-27T01:16:32Z<p>Trttse: /* Related work */</p>
<hr />
<div>= Introduction =<br />
<br />
Most supervised machine learning models are designed to approximate a function that maps input data to a desirable output (e.g. a class label for an image or a translation of a sentence from one language to another). In this sense, <br />
such models perform inference using a 'fixed' memory in the form of a set of parameters learned during training. For example, the memory of a recurrent neural network is constituted largely by the weights on the recurrent connections to its hidden layer (along with the layer's activities). As is well known, this form of memory is inherently limited given the fixed dimensionality of the weights in question. It is largely for this reason that recurrent nets have difficulty learning long-range dependencies in sequential data. Learning such dependencies, note, requires ''remembering'' items in a sequence for a large number of time steps. <br />
<br />
For an interesting class of problems, it is essential for a model to be able to learn long-term dependencies, and to more generally be able to learn to perform inferences using an arbitrarily large memory. Question-answering tasks are paradigmatic of this class of problems, since performing well on such tasks requires remembering all of the information that constitutes a possible answer to the questions being posed. In principle, a recurrent network such as an LSTM could learn to perform QA tasks, but in practice, the amount of information that can be retained by the weights and the hidden states in the LSTM is simply insufficient. <br />
<br />
Given this need for a model architecture the combines inference and memory in a sophisticated manner, the authors of this paper propose what they refer to as a "Memory Network". In brief, a memory network is a model that learns to read and write data to an arbitrarily large long-term memory, while also using the data in this memory to perform inferences. The rest of this summary describes the components of a memory network in greater detail, along with some experiments describing its application to a question answering task involving short stories. Below is an example illustrating the model's ability to answer simple questions after being presented with short, multi-sentence stories. <br />
<br />
[[File:QA_example.png | frame | centre | Example answers (in red) using a memory network for question answering. ]]<br />
<br />
= Model Architecture =<br />
<br />
A memory network is composed of a memory <math>\ m</math> (in the form of a collection of vectors or strings, indexed individually as <math>\ m_i</math>), and four possibly learned functions <math>\ I</math>, <math>\ G</math>, <math>\ O</math>, and <math>\ R</math>. The functions are defined as follows:<br />
*<math>\ I</math> maps a natural language expression onto an 'input' feature representation (e.g., a real-valued vector). The input can either be a fact to be added to the memory <math>\ m</math> (e.g. 'John is at the university') , or a question for which an answer is being sought (e.g. 'Where is John?'). <br />
*<math>\ G</math> updates the contents of the memory <math>\ m</math> on the basis of an input. The updating can involve simply writing the input to new memory location, or it can involve the modification or compression of existing memories to perform a kind of generalization on the state of the memory. <br />
*<math>\ O</math> produces an 'output' feature representation given a new input and the current state of the memory. The input and output feature representations reside in the same embedding space. <br />
*<math>\ R</math> produces a response given an output feature representation. This response is usually a word or a sentence, but in principle it could also be an action of some kind (e.g. the movement of a robot)<br />
<br />
To give a quick overview of how the model operates, an input ''x'' will first be mapped to a feature representation <math>\ I(x)</math> Then, for all memories ''i'', the following update is applied: <math>\ m_i = G(m_i, I(x), m) </math>. This means that each memory is updated on the basis of the input ''x'' and the current state of the memory <math>\ m</math>. In the case where each input is simply written to memory, <math>\ G</math> might function to simply select an index that is currently unused and write <math>\ I(x)</math> to the memory location corresponding to this index. Next, an output feature representation is computed as <math>\ o=O(I(x), m)</math>, and a response, <math>\ r</math>, is computed directly from this feature representation as <math>\ r=R(o)</math>. <math>\ O</math> can be interpreted as retrieving a small selection of memories that are relevant to producing a good response, and <math>\ R</math> actually produces the response given the feature representation produced from the relevant memories by <math>\ O</math>.<br />
<br />
= A Basic Implementation =<br />
<br />
In a simple version of the memory network, input text is just written to memory in unaltered form. Or in other words, <math>\ I(x) </math> simply returns ''x'', and <math>\ G </math> writes this text to a new memory slot <math>\ m_{N+1} </math> if <math>\ N </math> is the number of currently filled slots. The memory is accordingly an array of strings, and the inclusion of a new string does nothing to modify existing strings. <br />
<br />
Given as much, most of the work being done by the model is performed by the functions <math>\ O </math> and <math>\ R </math>. The job of <math>\ O </math> is to produce an output feature representation by selecting <math>\ k </math> supporting memories from <math>\ m </math> on the basis of the input ''x''. In the experiments described in this paper, <math>\ k </math> is set to either 1 or 2. In the case that <math>\ k=1 </math>, the function <math>\ O </math> behaves as follows: <br />
<br />
<br />
:<math>\ o_1 = O_1(x, m) = argmax_{i = 1 ... N} S_O(x, m_i) </math><br />
<br />
<br />
where <math>\ S_O </math> is a function that scores a candidate memory for its compatibility with ''x''. Essentially, one 'supporting' memory is selected from <math>\ m </math> as being most likely to contain the information needed to answer the question posed in <math>\ x </math>. In this case, the output is <math>\ o_1 = [x, m_{o_1}] </math>, or a list containing the input question and one supporting memory. Alternatively, in the case that <math>\ k=2 </math>', a second supporting memory is selected on the basis of the input and the first supporting memory, as follows: <br />
<br />
<br />
:<math>\ o_2 = O_2(x, m) = argmax_{i = 1 ... N} S_O([x, m_{o_1}], m_i) </math><br />
<br />
<br />
Now, the overall output is <math>\ o_2 = [x, m_{o_1}, m_{o_2}] </math>. (These lists are translated into feature representations as described below). Finally, the result of <math>\ O </math> is used to produce a response in the form of a single word via <math>\ R </math> as follows:<br />
<br />
<br />
:<math>\ r = argmax_{w \epsilon W} S_R([x, m_{o_1}, m_{o_2}], w) </math><br />
<br />
<br />
In short, a response is produced by scoring each word in a set of candidate words against the representation produced by the combination of the input and the two supporting memories. The highest scoring candidate word is then chosen as the model's output. The learned portions of <math>\ O </math> and <math>\ R </math> are the parameters of the functions <math>\ S_O </math> and <math>\ S_R </math>, which perform embeddings of the raw text constituting each function argument, and then return the dot product of these two embeddings as a score. Formally, the function <math>\ S_O </math> can be defined as follows; <math>\ S_R </math> is defined analogously:<br />
<br />
<br />
:<math>\ S_O(x, y) = \Phi_x(x)^T U^T U \Phi_y(y) </math><br />
<br />
<br />
In this equation, <math>\ U </math> is an <math>\ n \times D </math> matrix, where ''n'' is the dimension of the embedding space, and ''D'' is the number of features used to represent each function argument. <math>\ \Phi_x</math> and <math>\ \Phi_y </math> are functions that map each argument (which are strings) into the feature space. In the implementations considered in this paper, the feature space makes use of a bag-of-words representation, such that there are 3 binary features for each word in the model's vocabulary. The first feature corresponds to the presence of the word in the input ''x'', the second feature corresponds to the presence of the word in first supporting memory that is being used to select a second supporting memory, and the third feature representation corresponds to the presence of the word in a candidate memory being scored (i.e. either the first or second supporting memory retrieved by the model). Having these different features allows the model to learn distinct representations for the same word depending on whether the word is present in an input question or in a string stored in memory. <br />
<br />
Intuitively, it helps to think of the columns of <math>\ U </math> containing distributed representations of each word in the vocabulary (specifically, there are 3 representations and hence 3 columns devoted to each word). The binary feature representation <math>\ \Phi_x(x)</math> maps the text in ''x'' onto a binary feature vector, where 1's in the vector indicate the presence of a particular word in ''x'', and 0's indicate the absence of this word. Note that different elements of the vector will be set to 1 depending on whether the word occurs in the input ''x'' or in a supporting memory (i.e. when ''x'' is a list containing the input and a supporting memory). The matrix-vector multiplications in the above equation effectively extract and sum the distributed representations corresponding to each of the inputs, ''x'' and ''y''. Thus, a single distributed representation is produced for each input, and the resulting score is the dot product of these two vectors (which in turn is the cosine of the angle between the vectors scaled by the product of the vector norms). In the case where ''x'' is the input query, and ''y'' is a candidate memory, a high dot product indicates that the model thinks that the candidate in question is very relevant to answering the input query. In the case where ''x'' is the output of <math>\ O</math> and ''y'' is a candidate response word, a high dot product indicates that the model thinks that the response word is an appropriate answer given the output feature representation produced by <math>\ O</math>. Distinct embedding matrices <math>\ U_O </math> and <math>\ U_R </math> are used to compute the output feature representation and the response. <br />
<br />
The goal of learning is find embedding matrices in which the representations produced for queries, supporting memories, and responses are spatially related such that representations of relevant supporting memories are close to the representations of a query, and such that representations of individual words are close to the output feature representations of the questions they answer. The method used to perform this learning is described in the next section. <br />
<br />
= The Training Procedure =<br />
<br />
Learning is conducted in a supervised manner; the correct responses and supporting memories for each query are provided during training. The following margin-ranking loss function is used in tandem with stochastic gradient descent to learn the parameters of <math>\ U_O </math> and <math>\ U_R </math>, given an input ''x'', a desired response ''r'', and desired supporting memories, <math>\ m_{o_1}</math> and <math>\ m_{o_2}</math>:<br />
<br />
:<math> \sum_{f \neq m_{o_1}} max(0, \gamma + S_O (x, f) - S_O (x, m_{o_1})) + \sum_{f^' \neq m_{o_2}} max(0, \gamma + S_O ([x, m_{o_1}], f^') - S_O ([x, m_{o_1}], m_{o_2})) + </math><br />
:<math> \sum_{r^' \neq r} max(0, \gamma + S_R ([x, m_{o_1}, m_{o_2}], r^') - S_R ([x, m_{o_1}, m_{o_2}], r)) </math><br />
<br />
where <math>\ f</math>, <math>\ f^'</math> and <math>\ r^'</math> correspond to incorrect candidates for the first supporting memory, the second supporting memory, and the output response, and <math> \gamma</math> corresponds to the margin. Intuitively, each term in the sum penalizes the current parameters in proportion to the number of incorrect memories and responses that get assigned a score within the margin of the score of the correct memories and responses. Or in other words, if the score of a correct candidate memory / response is higher than the score of every incorrect candidate by at least <math> \gamma </math>, the cost is 0. Otherwise, the cost is the sum over all of the differences between the incorrect scores (plus gamma) and the correct score. In fact, this is just the standard hinge loss function. Weston et al. speed up gradient descent by sampling incorrect candidates instead of using all incorrect candidates in the calculation of the gradient for each training example. <br />
<br />
= Extensions to the Basic Implementation = <br />
<br />
Some limitations of the basic implementation are that it can only output single word responses, can only accept strings (rather than sequences) as input, and cannot use its memory in efficient or otherwise interesting ways. The authors propose a series of extensions to the basic implementation described in the previous section that are designed to overcome these limitations. First, they propose a segmenting function that learns when to segment an input sequence into discrete chunks that get written to individual memory slots. The segmenter is modeled similarly to other components, as an embedding model of the form:<br />
<br />
<math><br />
seg(c)=W^T_{seg}U_s\Phi_{seg}(c)<br />
</math><br />
<br />
where <math>W_{seg}</math> is a vector (effectively the parameters of a linear classifier in embedding space), and <math>c</math> is the sequence of input words represented as a bag of words using a separate dictionary. If <math>seg(c) > \gamma</math>, where <math>\gamma</math> is the margin, then this sequence is recognized as a segment.<br />
<br />
Second, they propose the use of hashing to avoid scoring a prohibitively large number of candidate memories. Each input corresponding to a query is hashed into some number of buckets, and only candidates within these buckets are scored during the selection of supporting memories. Hashing is done either by making a bucket per word in the model's vocabulary, or by clustering the learning word embeddings, and creating a bucket per cluster. <br />
<br />
The most important extension proposed by the authors involves incorporating information about the time at which a memory was written into the scoring function <math>/ S_O </math>. The model needs to be able to make use of such information to correctly answer questions such as "Where was John before the university" (assuming the model has been told some story about John). To handle temporal information, the feature space is extend to include features that indicate the relative time between when two items where written to memory. Formally, this yields the following revised scoring function:<br />
<br />
<br />
<math>\ S_{O_t}(x, y, y^') = \Phi_x(x)^T U^T U (\Phi_y(y)-\Phi_y(y^')+\Phi_t(x,y,y^'))</math><br />
<br />
<br />
The novelty here lies in the feature mapping function <math> \Phi_t </math>, which takes an input and two candidate supporting memories, and returns a binary feature vector as before, but with the addition of three features that indicate whether <math>x</math> is older than <math>y</math>, whether <math>x</math> is older than <math>y^'</math>, and whether <math>y</math> is older than <math>y'</math>. The model loops over all candidate memories, comparing candidates <math>y</math> and <math>y^'</math>. If <math> S_{O_t}(x, y, y^') </math> is greater than 0, then <math>y</math> is preferred over <math>y^'</math>; otherwise, <math>y'</math> is preferred. If <math>y'</math> is preferred, <math>y</math> is replaced by <math>y'</math> and the loop continues to the next candidate memory (i.e. the new <math>y^'</math>. Once the loop finishes iterating over the entire memory, the winning candidate <math>y</math> is chosen as the supporting memory. <br />
<br />
Some further extensions concern allowing the model to deal with words not included in it's vocabulary, and to more effectively take advantage of exact word matches between input queries and candidate supporting memories.<br />
<br />
Embedding models cannot efficiently use exact word matches due to the low dimensionality <math>n</math>. One solution is to score a pair <math>x,y</math> with <math>\ \Phi_x(x)U^TU\Phi_y(y)+\lambda\Phi_x(x)^T\Phi_y(y) </math> instead. That is, add the “bag of words” matching score to the learned embedding score (with a mixing parameter λ). Another related way is to stay in the n-dimensional embedding space, but to extend the feature representation D with matching features, e.g., one per word. A matching feature indicates if a word occurs in both x and y. That is, we score with <math>\ \Phi_x(x)U^TU\Phi_y(y,x)</math> where <math>\ \Phi_y</math> is actually built conditionally on x: if some of the words in y match the words in x we set those matching features to 1. Unseen words can be modeled similarly by using matching features on their context words. This then gives a feature space of D = 8|W|.<br />
<br />
= Related work =<br />
<br />
There are two general approaches to performing question answering that have been developed in the literature. The first makes use of a technique known as semantic parsing to map a query expressed in natural language onto a representation in some formal language that directly extracts information from some external memory such as a knowledge base<ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>P. Liang, M. Jordan, and D. Klein. [http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00127 "Learning dependency-based compositional semantics"]. In Computational Linguistics, 39.2, p. 389-446. </ref>. Neural networks and embedding approaches have also been recently explored. Compared to recent knowledge based approaches memory networks differ in that they do not apply a two-stage strategy: (i) <br />
<br />
The second makes use of embedding methods to represent queries and candidate answers (typically extracted from a knowledge base) as high-dimensional vectors. Learning involves producing embeddings that place query vectors close to the vectors that correspond to their answers. <ref>Bordes, A., S. Chopra, and J. Weston. [http://www.thespermwhale.com/jaseweston/papers/fbqa.pdf "Question Answering with Subgraph Embeddings"]. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (2014)</ref>. Memory networks fall under the latter approach, and existing variants of this approach can been seen as special cases of the memory network architecture (e.g., <ref>Bordes, A., S. Chopra, and J. Weston. [http://www.thespermwhale.com/jaseweston/papers/fbqa.pdf "Question Answering with Subgraph Embeddings"]. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (2014)</ref>)<br />
<br />
= Experimental Results = <br />
<br />
The authors first test a simple memory network (i.e. <math>\ k=1 </math> on a large scale question answering task involving a dataset consisting of 14 million subject-relation-object triplets. Each triplet is stored as an item in memory, and the answers to particular questions are a single entity (i.e. a subject or object) in one these triplets. The results in the table below indicate that memory networks perform quite well on this task. Note that the memory network with 'bag of words' features includes the extension designed to indicate the presence of exact matches of words in a query and a candidate answer. This seems to contribute significantly to improved performance. <br />
<br />
[[File:largescale.png | frame | centre | Results on a large-scale QA task.]]<br />
<br />
Scoring a query against all 14 million candidate memories is slow, so the the authors also test their hashing techniques and report the resulting speed-accuracy tradeoffs. As shown in the figure below, the use of cluster-based hashing results in a negligible drop in performance while considering only 1/80th of the complete set of items stored in memory. <br />
<br />
[[File:hash.png | frame | centre | Memory hashing results on a large-scale QA task.]]<br />
<br />
To test their model on more complex tasks that require chains of inference, the authors create a synthetic dataset consisting approximately 7 thousand statements and 3 thousand questions focused on a toy environment comprised of a 4 people, 3 objects, and 5 rooms. Stories involving multiple statements describing actions performed by these people (e.g. moving an object from one room to another) are used to define the question answering tasks. Questions are focused on a single entity mentioned in a story, and the difficulty of the task is controlled by varying how long ago the most recent mention of this entity is in the story (e.g. the most recent statement in the story vs. the 5th most recent statement in the story). The figure at the top of this page gives an example of these tasks being performed. <br />
<br />
In the results below, 'Difficulty 1' tasks are those in which the entity being asked about was mentioned in the most recent statement of the story, while 'Difficulty 5' tasks are those in which the entity being asked about was mentioned in one of the 5 most recent statements. Questions about an 'actor' concern a statement that mentions a person but not an object (e.g. "John went to the garden"). The questions may ask for the current location of the person (e.g. "where is John?") or the previous location of the person (e.g. "Where was John before the garden?") (the column labelled "actor w/o before" in the figure below excludes this latter type of question). More complex questions involve asking about the object in a statement that mentions both a person and an object (e.g. "John dropped the milk", the question might "Where is the milk?"). Note that this task is more challenging, since it requires using multiple pieces of information (i.e. where John was, and what he did while he was there). Comparisons using RNNs and LSTMs are also reported, and for multiword responses as in the first figure above, an LSTM is used in place of <math>\ R </math><br />
<br />
[[File:toyqa.png | frame | centre | Test accuracy on a simulated world QA task.]]<br />
<br />
What is most notable about these results is that the inclusion of time features in the MemNN seems to be responsible for most of the improvement over RNNs and LSTMs. <br />
<br />
<br />
= Discussion = <br />
<br />
One potential concern about the memory network architecture concerns its generalizability to large values of <math>\ k </math>. To explain, each additional supporting memory increases the number of columns in the embedding matrices by the size of the model's vocabulary. This could become impractical for standard vocabularies with tens of thousands of terms. <br />
<br />
A second concern is that the memory network, as described, is engineered to answer very particular kinds of questions (i.e. questions in which the order of events is important). To handle different kinds of questions, different features would likely need to be added (e.g. quantificational features to handle statements involving quantifiers such as 'some', 'many', etc.). This sort of ad-hoc design calls into question whether the architecture is capable of performing scalable, general-purpose question answering. <br />
<br />
= Resources =<br />
<br />
Memory Network implementations on [https://github.com/facebook/MemNN Github] <br />
<br />
= Bibliography =<br />
<br />
<references /></div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=memory_Networks&diff=26912memory Networks2015-11-27T01:14:35Z<p>Trttse: /* Related work */</p>
<hr />
<div>= Introduction =<br />
<br />
Most supervised machine learning models are designed to approximate a function that maps input data to a desirable output (e.g. a class label for an image or a translation of a sentence from one language to another). In this sense, <br />
such models perform inference using a 'fixed' memory in the form of a set of parameters learned during training. For example, the memory of a recurrent neural network is constituted largely by the weights on the recurrent connections to its hidden layer (along with the layer's activities). As is well known, this form of memory is inherently limited given the fixed dimensionality of the weights in question. It is largely for this reason that recurrent nets have difficulty learning long-range dependencies in sequential data. Learning such dependencies, note, requires ''remembering'' items in a sequence for a large number of time steps. <br />
<br />
For an interesting class of problems, it is essential for a model to be able to learn long-term dependencies, and to more generally be able to learn to perform inferences using an arbitrarily large memory. Question-answering tasks are paradigmatic of this class of problems, since performing well on such tasks requires remembering all of the information that constitutes a possible answer to the questions being posed. In principle, a recurrent network such as an LSTM could learn to perform QA tasks, but in practice, the amount of information that can be retained by the weights and the hidden states in the LSTM is simply insufficient. <br />
<br />
Given this need for a model architecture the combines inference and memory in a sophisticated manner, the authors of this paper propose what they refer to as a "Memory Network". In brief, a memory network is a model that learns to read and write data to an arbitrarily large long-term memory, while also using the data in this memory to perform inferences. The rest of this summary describes the components of a memory network in greater detail, along with some experiments describing its application to a question answering task involving short stories. Below is an example illustrating the model's ability to answer simple questions after being presented with short, multi-sentence stories. <br />
<br />
[[File:QA_example.png | frame | centre | Example answers (in red) using a memory network for question answering. ]]<br />
<br />
= Model Architecture =<br />
<br />
A memory network is composed of a memory <math>\ m</math> (in the form of a collection of vectors or strings, indexed individually as <math>\ m_i</math>), and four possibly learned functions <math>\ I</math>, <math>\ G</math>, <math>\ O</math>, and <math>\ R</math>. The functions are defined as follows:<br />
*<math>\ I</math> maps a natural language expression onto an 'input' feature representation (e.g., a real-valued vector). The input can either be a fact to be added to the memory <math>\ m</math> (e.g. 'John is at the university') , or a question for which an answer is being sought (e.g. 'Where is John?'). <br />
*<math>\ G</math> updates the contents of the memory <math>\ m</math> on the basis of an input. The updating can involve simply writing the input to new memory location, or it can involve the modification or compression of existing memories to perform a kind of generalization on the state of the memory. <br />
*<math>\ O</math> produces an 'output' feature representation given a new input and the current state of the memory. The input and output feature representations reside in the same embedding space. <br />
*<math>\ R</math> produces a response given an output feature representation. This response is usually a word or a sentence, but in principle it could also be an action of some kind (e.g. the movement of a robot)<br />
<br />
To give a quick overview of how the model operates, an input ''x'' will first be mapped to a feature representation <math>\ I(x)</math> Then, for all memories ''i'', the following update is applied: <math>\ m_i = G(m_i, I(x), m) </math>. This means that each memory is updated on the basis of the input ''x'' and the current state of the memory <math>\ m</math>. In the case where each input is simply written to memory, <math>\ G</math> might function to simply select an index that is currently unused and write <math>\ I(x)</math> to the memory location corresponding to this index. Next, an output feature representation is computed as <math>\ o=O(I(x), m)</math>, and a response, <math>\ r</math>, is computed directly from this feature representation as <math>\ r=R(o)</math>. <math>\ O</math> can be interpreted as retrieving a small selection of memories that are relevant to producing a good response, and <math>\ R</math> actually produces the response given the feature representation produced from the relevant memories by <math>\ O</math>.<br />
<br />
= A Basic Implementation =<br />
<br />
In a simple version of the memory network, input text is just written to memory in unaltered form. Or in other words, <math>\ I(x) </math> simply returns ''x'', and <math>\ G </math> writes this text to a new memory slot <math>\ m_{N+1} </math> if <math>\ N </math> is the number of currently filled slots. The memory is accordingly an array of strings, and the inclusion of a new string does nothing to modify existing strings. <br />
<br />
Given as much, most of the work being done by the model is performed by the functions <math>\ O </math> and <math>\ R </math>. The job of <math>\ O </math> is to produce an output feature representation by selecting <math>\ k </math> supporting memories from <math>\ m </math> on the basis of the input ''x''. In the experiments described in this paper, <math>\ k </math> is set to either 1 or 2. In the case that <math>\ k=1 </math>, the function <math>\ O </math> behaves as follows: <br />
<br />
<br />
:<math>\ o_1 = O_1(x, m) = argmax_{i = 1 ... N} S_O(x, m_i) </math><br />
<br />
<br />
where <math>\ S_O </math> is a function that scores a candidate memory for its compatibility with ''x''. Essentially, one 'supporting' memory is selected from <math>\ m </math> as being most likely to contain the information needed to answer the question posed in <math>\ x </math>. In this case, the output is <math>\ o_1 = [x, m_{o_1}] </math>, or a list containing the input question and one supporting memory. Alternatively, in the case that <math>\ k=2 </math>', a second supporting memory is selected on the basis of the input and the first supporting memory, as follows: <br />
<br />
<br />
:<math>\ o_2 = O_2(x, m) = argmax_{i = 1 ... N} S_O([x, m_{o_1}], m_i) </math><br />
<br />
<br />
Now, the overall output is <math>\ o_2 = [x, m_{o_1}, m_{o_2}] </math>. (These lists are translated into feature representations as described below). Finally, the result of <math>\ O </math> is used to produce a response in the form of a single word via <math>\ R </math> as follows:<br />
<br />
<br />
:<math>\ r = argmax_{w \epsilon W} S_R([x, m_{o_1}, m_{o_2}], w) </math><br />
<br />
<br />
In short, a response is produced by scoring each word in a set of candidate words against the representation produced by the combination of the input and the two supporting memories. The highest scoring candidate word is then chosen as the model's output. The learned portions of <math>\ O </math> and <math>\ R </math> are the parameters of the functions <math>\ S_O </math> and <math>\ S_R </math>, which perform embeddings of the raw text constituting each function argument, and then return the dot product of these two embeddings as a score. Formally, the function <math>\ S_O </math> can be defined as follows; <math>\ S_R </math> is defined analogously:<br />
<br />
<br />
:<math>\ S_O(x, y) = \Phi_x(x)^T U^T U \Phi_y(y) </math><br />
<br />
<br />
In this equation, <math>\ U </math> is an <math>\ n \times D </math> matrix, where ''n'' is the dimension of the embedding space, and ''D'' is the number of features used to represent each function argument. <math>\ \Phi_x</math> and <math>\ \Phi_y </math> are functions that map each argument (which are strings) into the feature space. In the implementations considered in this paper, the feature space makes use of a bag-of-words representation, such that there are 3 binary features for each word in the model's vocabulary. The first feature corresponds to the presence of the word in the input ''x'', the second feature corresponds to the presence of the word in first supporting memory that is being used to select a second supporting memory, and the third feature representation corresponds to the presence of the word in a candidate memory being scored (i.e. either the first or second supporting memory retrieved by the model). Having these different features allows the model to learn distinct representations for the same word depending on whether the word is present in an input question or in a string stored in memory. <br />
<br />
Intuitively, it helps to think of the columns of <math>\ U </math> containing distributed representations of each word in the vocabulary (specifically, there are 3 representations and hence 3 columns devoted to each word). The binary feature representation <math>\ \Phi_x(x)</math> maps the text in ''x'' onto a binary feature vector, where 1's in the vector indicate the presence of a particular word in ''x'', and 0's indicate the absence of this word. Note that different elements of the vector will be set to 1 depending on whether the word occurs in the input ''x'' or in a supporting memory (i.e. when ''x'' is a list containing the input and a supporting memory). The matrix-vector multiplications in the above equation effectively extract and sum the distributed representations corresponding to each of the inputs, ''x'' and ''y''. Thus, a single distributed representation is produced for each input, and the resulting score is the dot product of these two vectors (which in turn is the cosine of the angle between the vectors scaled by the product of the vector norms). In the case where ''x'' is the input query, and ''y'' is a candidate memory, a high dot product indicates that the model thinks that the candidate in question is very relevant to answering the input query. In the case where ''x'' is the output of <math>\ O</math> and ''y'' is a candidate response word, a high dot product indicates that the model thinks that the response word is an appropriate answer given the output feature representation produced by <math>\ O</math>. Distinct embedding matrices <math>\ U_O </math> and <math>\ U_R </math> are used to compute the output feature representation and the response. <br />
<br />
The goal of learning is find embedding matrices in which the representations produced for queries, supporting memories, and responses are spatially related such that representations of relevant supporting memories are close to the representations of a query, and such that representations of individual words are close to the output feature representations of the questions they answer. The method used to perform this learning is described in the next section. <br />
<br />
= The Training Procedure =<br />
<br />
Learning is conducted in a supervised manner; the correct responses and supporting memories for each query are provided during training. The following margin-ranking loss function is used in tandem with stochastic gradient descent to learn the parameters of <math>\ U_O </math> and <math>\ U_R </math>, given an input ''x'', a desired response ''r'', and desired supporting memories, <math>\ m_{o_1}</math> and <math>\ m_{o_2}</math>:<br />
<br />
:<math> \sum_{f \neq m_{o_1}} max(0, \gamma + S_O (x, f) - S_O (x, m_{o_1})) + \sum_{f^' \neq m_{o_2}} max(0, \gamma + S_O ([x, m_{o_1}], f^') - S_O ([x, m_{o_1}], m_{o_2})) + </math><br />
:<math> \sum_{r^' \neq r} max(0, \gamma + S_R ([x, m_{o_1}, m_{o_2}], r^') - S_R ([x, m_{o_1}, m_{o_2}], r)) </math><br />
<br />
where <math>\ f</math>, <math>\ f^'</math> and <math>\ r^'</math> correspond to incorrect candidates for the first supporting memory, the second supporting memory, and the output response, and <math> \gamma</math> corresponds to the margin. Intuitively, each term in the sum penalizes the current parameters in proportion to the number of incorrect memories and responses that get assigned a score within the margin of the score of the correct memories and responses. Or in other words, if the score of a correct candidate memory / response is higher than the score of every incorrect candidate by at least <math> \gamma </math>, the cost is 0. Otherwise, the cost is the sum over all of the differences between the incorrect scores (plus gamma) and the correct score. In fact, this is just the standard hinge loss function. Weston et al. speed up gradient descent by sampling incorrect candidates instead of using all incorrect candidates in the calculation of the gradient for each training example. <br />
<br />
= Extensions to the Basic Implementation = <br />
<br />
Some limitations of the basic implementation are that it can only output single word responses, can only accept strings (rather than sequences) as input, and cannot use its memory in efficient or otherwise interesting ways. The authors propose a series of extensions to the basic implementation described in the previous section that are designed to overcome these limitations. First, they propose a segmenting function that learns when to segment an input sequence into discrete chunks that get written to individual memory slots. The segmenter is modeled similarly to other components, as an embedding model of the form:<br />
<br />
<math><br />
seg(c)=W^T_{seg}U_s\Phi_{seg}(c)<br />
</math><br />
<br />
where <math>W_{seg}</math> is a vector (effectively the parameters of a linear classifier in embedding space), and <math>c</math> is the sequence of input words represented as a bag of words using a separate dictionary. If <math>seg(c) > \gamma</math>, where <math>\gamma</math> is the margin, then this sequence is recognized as a segment.<br />
<br />
Second, they propose the use of hashing to avoid scoring a prohibitively large number of candidate memories. Each input corresponding to a query is hashed into some number of buckets, and only candidates within these buckets are scored during the selection of supporting memories. Hashing is done either by making a bucket per word in the model's vocabulary, or by clustering the learning word embeddings, and creating a bucket per cluster. <br />
<br />
The most important extension proposed by the authors involves incorporating information about the time at which a memory was written into the scoring function <math>/ S_O </math>. The model needs to be able to make use of such information to correctly answer questions such as "Where was John before the university" (assuming the model has been told some story about John). To handle temporal information, the feature space is extend to include features that indicate the relative time between when two items where written to memory. Formally, this yields the following revised scoring function:<br />
<br />
<br />
<math>\ S_{O_t}(x, y, y^') = \Phi_x(x)^T U^T U (\Phi_y(y)-\Phi_y(y^')+\Phi_t(x,y,y^'))</math><br />
<br />
<br />
The novelty here lies in the feature mapping function <math> \Phi_t </math>, which takes an input and two candidate supporting memories, and returns a binary feature vector as before, but with the addition of three features that indicate whether <math>x</math> is older than <math>y</math>, whether <math>x</math> is older than <math>y^'</math>, and whether <math>y</math> is older than <math>y'</math>. The model loops over all candidate memories, comparing candidates <math>y</math> and <math>y^'</math>. If <math> S_{O_t}(x, y, y^') </math> is greater than 0, then <math>y</math> is preferred over <math>y^'</math>; otherwise, <math>y'</math> is preferred. If <math>y'</math> is preferred, <math>y</math> is replaced by <math>y'</math> and the loop continues to the next candidate memory (i.e. the new <math>y^'</math>. Once the loop finishes iterating over the entire memory, the winning candidate <math>y</math> is chosen as the supporting memory. <br />
<br />
Some further extensions concern allowing the model to deal with words not included in it's vocabulary, and to more effectively take advantage of exact word matches between input queries and candidate supporting memories.<br />
<br />
Embedding models cannot efficiently use exact word matches due to the low dimensionality <math>n</math>. One solution is to score a pair <math>x,y</math> with <math>\ \Phi_x(x)U^TU\Phi_y(y)+\lambda\Phi_x(x)^T\Phi_y(y) </math> instead. That is, add the “bag of words” matching score to the learned embedding score (with a mixing parameter λ). Another related way is to stay in the n-dimensional embedding space, but to extend the feature representation D with matching features, e.g., one per word. A matching feature indicates if a word occurs in both x and y. That is, we score with <math>\ \Phi_x(x)U^TU\Phi_y(y,x)</math> where <math>\ \Phi_y</math> is actually built conditionally on x: if some of the words in y match the words in x we set those matching features to 1. Unseen words can be modeled similarly by using matching features on their context words. This then gives a feature space of D = 8|W|.<br />
<br />
= Related work =<br />
<br />
There are two general approaches to performing question answering that have been developed in the literature. The first makes use of a technique known as semantic parsing to map a query expressed in natural language onto a representation in some formal language that directly extracts information from some external memory such as a knowledge base<ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>P. Liang, M. Jordan, and D. Klein. [http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00127 "Learning dependency-based compositional semantics"]. In Computational Linguistics, 39.2, p. 389-446. </ref>. <br />
<br />
The second makes use of embedding methods to represent queries and candidate answers (typically extracted from a knowledge base) as high-dimensional vectors. Learning involves producing embeddings that place query vectors close to the vectors that correspond to their answers. <ref>Bordes, A., S. Chopra, and J. Weston. [http://www.thespermwhale.com/jaseweston/papers/fbqa.pdf "Question Answering with Subgraph Embeddings"]. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (2014)</ref>. Memory networks fall under the latter approach, and existing variants of this approach can been seen as special cases of the memory network architecture (e.g., <ref>Bordes, A., S. Chopra, and J. Weston. [http://www.thespermwhale.com/jaseweston/papers/fbqa.pdf "Question Answering with Subgraph Embeddings"]. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (2014)</ref>)<br />
<br />
= Experimental Results = <br />
<br />
The authors first test a simple memory network (i.e. <math>\ k=1 </math> on a large scale question answering task involving a dataset consisting of 14 million subject-relation-object triplets. Each triplet is stored as an item in memory, and the answers to particular questions are a single entity (i.e. a subject or object) in one these triplets. The results in the table below indicate that memory networks perform quite well on this task. Note that the memory network with 'bag of words' features includes the extension designed to indicate the presence of exact matches of words in a query and a candidate answer. This seems to contribute significantly to improved performance. <br />
<br />
[[File:largescale.png | frame | centre | Results on a large-scale QA task.]]<br />
<br />
Scoring a query against all 14 million candidate memories is slow, so the the authors also test their hashing techniques and report the resulting speed-accuracy tradeoffs. As shown in the figure below, the use of cluster-based hashing results in a negligible drop in performance while considering only 1/80th of the complete set of items stored in memory. <br />
<br />
[[File:hash.png | frame | centre | Memory hashing results on a large-scale QA task.]]<br />
<br />
To test their model on more complex tasks that require chains of inference, the authors create a synthetic dataset consisting approximately 7 thousand statements and 3 thousand questions focused on a toy environment comprised of a 4 people, 3 objects, and 5 rooms. Stories involving multiple statements describing actions performed by these people (e.g. moving an object from one room to another) are used to define the question answering tasks. Questions are focused on a single entity mentioned in a story, and the difficulty of the task is controlled by varying how long ago the most recent mention of this entity is in the story (e.g. the most recent statement in the story vs. the 5th most recent statement in the story). The figure at the top of this page gives an example of these tasks being performed. <br />
<br />
In the results below, 'Difficulty 1' tasks are those in which the entity being asked about was mentioned in the most recent statement of the story, while 'Difficulty 5' tasks are those in which the entity being asked about was mentioned in one of the 5 most recent statements. Questions about an 'actor' concern a statement that mentions a person but not an object (e.g. "John went to the garden"). The questions may ask for the current location of the person (e.g. "where is John?") or the previous location of the person (e.g. "Where was John before the garden?") (the column labelled "actor w/o before" in the figure below excludes this latter type of question). More complex questions involve asking about the object in a statement that mentions both a person and an object (e.g. "John dropped the milk", the question might "Where is the milk?"). Note that this task is more challenging, since it requires using multiple pieces of information (i.e. where John was, and what he did while he was there). Comparisons using RNNs and LSTMs are also reported, and for multiword responses as in the first figure above, an LSTM is used in place of <math>\ R </math><br />
<br />
[[File:toyqa.png | frame | centre | Test accuracy on a simulated world QA task.]]<br />
<br />
What is most notable about these results is that the inclusion of time features in the MemNN seems to be responsible for most of the improvement over RNNs and LSTMs. <br />
<br />
<br />
= Discussion = <br />
<br />
One potential concern about the memory network architecture concerns its generalizability to large values of <math>\ k </math>. To explain, each additional supporting memory increases the number of columns in the embedding matrices by the size of the model's vocabulary. This could become impractical for standard vocabularies with tens of thousands of terms. <br />
<br />
A second concern is that the memory network, as described, is engineered to answer very particular kinds of questions (i.e. questions in which the order of events is important). To handle different kinds of questions, different features would likely need to be added (e.g. quantificational features to handle statements involving quantifiers such as 'some', 'many', etc.). This sort of ad-hoc design calls into question whether the architecture is capable of performing scalable, general-purpose question answering. <br />
<br />
= Resources =<br />
<br />
Memory Network implementations on [https://github.com/facebook/MemNN Github] <br />
<br />
= Bibliography =<br />
<br />
<references /></div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26094question Answering with Subgraph Embeddings2015-11-10T06:14:48Z<p>Trttse: </p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref name=one>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [http://arxiv.org/pdf/1406.3676v3.pdf "Freebase: a collaboratively created graph database for structuring human knowledge."] In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics<ref name=two>O. Kolomiyets and M.-F. Moens. [https://lirias.kuleuven.be/bitstream/123456789/313539/1/KolomiyetsMoensIS2011.pdf "A survey on question answering technology from an information retrieval perspective."] Information Sciences, 181(24):5412–5434, 2011.</ref><ref>C. Unger, L. B¨uhmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano [http://liris.cnrs.fr/~pchampin/enseignement/semweb/_static/articles/unger_2012.pdf "Template-based Question Answering over RDF Data"] In Proceedings of the 21st international conference on World Wide Web, 2012.</ref><ref name=four>X. Yao and B. Van Durme. [http://cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf "Information extraction over structured data: Question answering with freebase."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref>.<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer <ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from<br />
question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref name=six>T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. [http://yoavartzi.com/pub/kcaz-emnlp.2013.pdf "Scaling Semantic Parsers with On-the-fly Ontology Matching."] In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>J. Berant and P. Liang. [http://cs.stanford.edu/~pliang/papers/paraphrasing-acl2014.pdf "Semantic parsing via paraphrasing."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref><ref>A. Fader, L. Zettlemoyer, and O. Etzioni. [https://homes.cs.washington.edu/~lsz/papers/fze-kdd14.pdf "Open Question Answering Over Curated and Extracted Knowledge Bases."] In Proceedings of KDD’14. ACM, 2014.</ref><br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
<ref name=five>A. Bordes, J. Weston, and N. Usunier. [http://arxiv.org/pdf/1404.4326v1.pdf "Open question answering with weakly supervised embedding models."] In Proceedings of ECML-PKDD’14. Springer, 2014.</ref> proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of <ref name=five/> specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions <ref name=one/> was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per <ref name=one/> and <ref name=ten>T. Lin, O. Etzioni, et al. [http://aiweb.cs.washington.edu/research/projects/aiweb/media/papers/elaws.pdf "Entity Linking at Web Scale."] In Proceedings of the Joint<br />
Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 84–88. Association for Computational Linguistics, 2012.</ref>. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: <ref name=six/> harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:qaembedding_table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:qaembedding_figure1.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed <ref name=four/>, <ref name=one/> and <ref name=five/>, and performs similarly as <ref name=two/>.<br />
<br />
[[File:qaembedding_table3.JPG | center]]<br />
<br />
==Conclusion==<br />
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.<br />
<br />
==Bibliography==<br />
<references /></div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Qaembedding_table3.JPG&diff=26093File:Qaembedding table3.JPG2015-11-10T06:13:38Z<p>Trttse: </p>
<hr />
<div></div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26092question Answering with Subgraph Embeddings2015-11-10T06:13:16Z<p>Trttse: /* Experiments */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref name=one>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [http://arxiv.org/pdf/1406.3676v3.pdf "Freebase: a collaboratively created graph database for structuring human knowledge."] In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics<ref name=two>O. Kolomiyets and M.-F. Moens. [https://lirias.kuleuven.be/bitstream/123456789/313539/1/KolomiyetsMoensIS2011.pdf "A survey on question answering technology from an information retrieval perspective."] Information Sciences, 181(24):5412–5434, 2011.</ref><ref>C. Unger, L. B¨uhmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano [http://liris.cnrs.fr/~pchampin/enseignement/semweb/_static/articles/unger_2012.pdf "Template-based Question Answering over RDF Data"] In Proceedings of the 21st international conference on World Wide Web, 2012.</ref><ref name=four>X. Yao and B. Van Durme. [http://cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf "Information extraction over structured data: Question answering with freebase."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref>.<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer <ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from<br />
question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref name=six>T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. [http://yoavartzi.com/pub/kcaz-emnlp.2013.pdf "Scaling Semantic Parsers with On-the-fly Ontology Matching."] In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>J. Berant and P. Liang. [http://cs.stanford.edu/~pliang/papers/paraphrasing-acl2014.pdf "Semantic parsing via paraphrasing."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref><ref>A. Fader, L. Zettlemoyer, and O. Etzioni. [https://homes.cs.washington.edu/~lsz/papers/fze-kdd14.pdf "Open Question Answering Over Curated and Extracted Knowledge Bases."] In Proceedings of KDD’14. ACM, 2014.</ref><br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
<ref name=five>A. Bordes, J. Weston, and N. Usunier. [http://arxiv.org/pdf/1404.4326v1.pdf "Open question answering with weakly supervised embedding models."] In Proceedings of ECML-PKDD’14. Springer, 2014.</ref> proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of <ref name=five/> specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions <ref name=one/> was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per <ref name=one/> and <ref name=ten>T. Lin, O. Etzioni, et al. [http://aiweb.cs.washington.edu/research/projects/aiweb/media/papers/elaws.pdf "Entity Linking at Web Scale."] In Proceedings of the Joint<br />
Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 84–88. Association for Computational Linguistics, 2012.</ref>. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: <ref name=six/> harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:qaembedding_table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:qaembedding_figure1.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed <ref name=four/>, <ref name=one/> and <ref name=five/>, and performs similarly as <ref name=two/>.<br />
<br />
[[File:qaembedding_table3.JPG | center]]<br />
<br />
==Conclusion==<br />
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Qaembedding_figure1.JPG&diff=26091File:Qaembedding figure1.JPG2015-11-10T06:12:46Z<p>Trttse: </p>
<hr />
<div></div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26090question Answering with Subgraph Embeddings2015-11-10T06:12:32Z<p>Trttse: /* Embedding Questions and Answers */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref name=one>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [http://arxiv.org/pdf/1406.3676v3.pdf "Freebase: a collaboratively created graph database for structuring human knowledge."] In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics<ref name=two>O. Kolomiyets and M.-F. Moens. [https://lirias.kuleuven.be/bitstream/123456789/313539/1/KolomiyetsMoensIS2011.pdf "A survey on question answering technology from an information retrieval perspective."] Information Sciences, 181(24):5412–5434, 2011.</ref><ref>C. Unger, L. B¨uhmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano [http://liris.cnrs.fr/~pchampin/enseignement/semweb/_static/articles/unger_2012.pdf "Template-based Question Answering over RDF Data"] In Proceedings of the 21st international conference on World Wide Web, 2012.</ref><ref name=four>X. Yao and B. Van Durme. [http://cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf "Information extraction over structured data: Question answering with freebase."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref>.<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer <ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from<br />
question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref name=six>T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. [http://yoavartzi.com/pub/kcaz-emnlp.2013.pdf "Scaling Semantic Parsers with On-the-fly Ontology Matching."] In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>J. Berant and P. Liang. [http://cs.stanford.edu/~pliang/papers/paraphrasing-acl2014.pdf "Semantic parsing via paraphrasing."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref><ref>A. Fader, L. Zettlemoyer, and O. Etzioni. [https://homes.cs.washington.edu/~lsz/papers/fze-kdd14.pdf "Open Question Answering Over Curated and Extracted Knowledge Bases."] In Proceedings of KDD’14. ACM, 2014.</ref><br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
<ref name=five>A. Bordes, J. Weston, and N. Usunier. [http://arxiv.org/pdf/1404.4326v1.pdf "Open question answering with weakly supervised embedding models."] In Proceedings of ECML-PKDD’14. Springer, 2014.</ref> proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of <ref name=five/> specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions <ref name=one/> was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per <ref name=one/> and <ref name=ten>T. Lin, O. Etzioni, et al. [http://aiweb.cs.washington.edu/research/projects/aiweb/media/papers/elaws.pdf "Entity Linking at Web Scale."] In Proceedings of the Joint<br />
Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 84–88. Association for Computational Linguistics, 2012.</ref>. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: <ref name=six/> harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:qaembedding_table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:qaembedding_figure1.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed <ref name=four/>, <ref name=one/> and <ref name=five/>, and performs similarly as <ref name=two/>.<br />
<br />
[[File:table3.JPG | center]]<br />
<br />
==Conclusion==<br />
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Qaembedding_table2.JPG&diff=26089File:Qaembedding table2.JPG2015-11-10T06:11:50Z<p>Trttse: </p>
<hr />
<div></div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26088question Answering with Subgraph Embeddings2015-11-10T06:10:49Z<p>Trttse: /* Task Definition */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref name=one>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [http://arxiv.org/pdf/1406.3676v3.pdf "Freebase: a collaboratively created graph database for structuring human knowledge."] In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics<ref name=two>O. Kolomiyets and M.-F. Moens. [https://lirias.kuleuven.be/bitstream/123456789/313539/1/KolomiyetsMoensIS2011.pdf "A survey on question answering technology from an information retrieval perspective."] Information Sciences, 181(24):5412–5434, 2011.</ref><ref>C. Unger, L. B¨uhmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano [http://liris.cnrs.fr/~pchampin/enseignement/semweb/_static/articles/unger_2012.pdf "Template-based Question Answering over RDF Data"] In Proceedings of the 21st international conference on World Wide Web, 2012.</ref><ref name=four>X. Yao and B. Van Durme. [http://cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf "Information extraction over structured data: Question answering with freebase."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref>.<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer <ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from<br />
question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref name=six>T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. [http://yoavartzi.com/pub/kcaz-emnlp.2013.pdf "Scaling Semantic Parsers with On-the-fly Ontology Matching."] In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>J. Berant and P. Liang. [http://cs.stanford.edu/~pliang/papers/paraphrasing-acl2014.pdf "Semantic parsing via paraphrasing."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref><ref>A. Fader, L. Zettlemoyer, and O. Etzioni. [https://homes.cs.washington.edu/~lsz/papers/fze-kdd14.pdf "Open Question Answering Over Curated and Extracted Knowledge Bases."] In Proceedings of KDD’14. ACM, 2014.</ref><br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
<ref name=five>A. Bordes, J. Weston, and N. Usunier. [http://arxiv.org/pdf/1404.4326v1.pdf "Open question answering with weakly supervised embedding models."] In Proceedings of ECML-PKDD’14. Springer, 2014.</ref> proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of <ref name=five/> specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions <ref name=one/> was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per <ref name=one/> and <ref name=ten>T. Lin, O. Etzioni, et al. [http://aiweb.cs.washington.edu/research/projects/aiweb/media/papers/elaws.pdf "Entity Linking at Web Scale."] In Proceedings of the Joint<br />
Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 84–88. Association for Computational Linguistics, 2012.</ref>. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: <ref name=six/> harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:qaembedding_table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed <ref name=four/>, <ref name=one/> and <ref name=five/>, and performs similarly as <ref name=two/>.<br />
<br />
[[File:table3.JPG | center]]<br />
<br />
==Conclusion==<br />
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26087question Answering with Subgraph Embeddings2015-11-10T06:06:13Z<p>Trttse: /* Experiments */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref name=one>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [http://arxiv.org/pdf/1406.3676v3.pdf "Freebase: a collaboratively created graph database for structuring human knowledge."] In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics<ref name=two>O. Kolomiyets and M.-F. Moens. [https://lirias.kuleuven.be/bitstream/123456789/313539/1/KolomiyetsMoensIS2011.pdf "A survey on question answering technology from an information retrieval perspective."] Information Sciences, 181(24):5412–5434, 2011.</ref><ref>C. Unger, L. B¨uhmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano [http://liris.cnrs.fr/~pchampin/enseignement/semweb/_static/articles/unger_2012.pdf "Template-based Question Answering over RDF Data"] In Proceedings of the 21st international conference on World Wide Web, 2012.</ref><ref name=four>X. Yao and B. Van Durme. [http://cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf "Information extraction over structured data: Question answering with freebase."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref>.<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer <ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from<br />
question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref name=six>T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. [http://yoavartzi.com/pub/kcaz-emnlp.2013.pdf "Scaling Semantic Parsers with On-the-fly Ontology Matching."] In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>J. Berant and P. Liang. [http://cs.stanford.edu/~pliang/papers/paraphrasing-acl2014.pdf "Semantic parsing via paraphrasing."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref><ref>A. Fader, L. Zettlemoyer, and O. Etzioni. [https://homes.cs.washington.edu/~lsz/papers/fze-kdd14.pdf "Open Question Answering Over Curated and Extracted Knowledge Bases."] In Proceedings of KDD’14. ACM, 2014.</ref><br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
<ref name=five>A. Bordes, J. Weston, and N. Usunier. [http://arxiv.org/pdf/1404.4326v1.pdf "Open question answering with weakly supervised embedding models."] In Proceedings of ECML-PKDD’14. Springer, 2014.</ref> proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of <ref name=five/> specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions <ref name=one/> was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per <ref name=one/> and <ref name=ten>T. Lin, O. Etzioni, et al. [http://aiweb.cs.washington.edu/research/projects/aiweb/media/papers/elaws.pdf "Entity Linking at Web Scale."] In Proceedings of the Joint<br />
Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 84–88. Association for Computational Linguistics, 2012.</ref>. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: <ref name=six/> harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed <ref name=four/>, <ref name=one/> and <ref name=five/>, and performs similarly as <ref name=two/>.<br />
<br />
[[File:table3.JPG | center]]<br />
<br />
==Conclusion==<br />
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26086question Answering with Subgraph Embeddings2015-11-10T06:05:29Z<p>Trttse: /* Introduction */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref name=one>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [http://arxiv.org/pdf/1406.3676v3.pdf "Freebase: a collaboratively created graph database for structuring human knowledge."] In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics<ref name=two>O. Kolomiyets and M.-F. Moens. [https://lirias.kuleuven.be/bitstream/123456789/313539/1/KolomiyetsMoensIS2011.pdf "A survey on question answering technology from an information retrieval perspective."] Information Sciences, 181(24):5412–5434, 2011.</ref><ref>C. Unger, L. B¨uhmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano [http://liris.cnrs.fr/~pchampin/enseignement/semweb/_static/articles/unger_2012.pdf "Template-based Question Answering over RDF Data"] In Proceedings of the 21st international conference on World Wide Web, 2012.</ref><ref name=four>X. Yao and B. Van Durme. [http://cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf "Information extraction over structured data: Question answering with freebase."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref>.<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer <ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from<br />
question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref name=six>T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. [http://yoavartzi.com/pub/kcaz-emnlp.2013.pdf "Scaling Semantic Parsers with On-the-fly Ontology Matching."] In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>J. Berant and P. Liang. [http://cs.stanford.edu/~pliang/papers/paraphrasing-acl2014.pdf "Semantic parsing via paraphrasing."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref><ref>A. Fader, L. Zettlemoyer, and O. Etzioni. [https://homes.cs.washington.edu/~lsz/papers/fze-kdd14.pdf "Open Question Answering Over Curated and Extracted Knowledge Bases."] In Proceedings of KDD’14. ACM, 2014.</ref><br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
<ref name=five>A. Bordes, J. Weston, and N. Usunier. [http://arxiv.org/pdf/1404.4326v1.pdf "Open question answering with weakly supervised embedding models."] In Proceedings of ECML-PKDD’14. Springer, 2014.</ref> proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of <ref name=five/> specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions <ref name=one/> was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per <ref name=one/> and <ref name=ten>T. Lin, O. Etzioni, et al. [http://aiweb.cs.washington.edu/research/projects/aiweb/media/papers/elaws.pdf "Entity Linking at Web Scale."] In Proceedings of the Joint<br />
Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 84–88. Association for Computational Linguistics, 2012.</ref>. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: <ref name=six/> harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed <ref name=fourteen/>, <ref name=one/> and <ref name=five/>, and performs similarly as <ref name=two/>.<br />
<br />
[[File:table3.JPG | center]]<br />
<br />
==Conclusion==<br />
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26085question Answering with Subgraph Embeddings2015-11-10T06:04:07Z<p>Trttse: /* Experiments */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref name=one>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [http://arxiv.org/pdf/1406.3676v3.pdf "Freebase: a collaboratively created graph database for structuring human knowledge."] In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics<ref name=two>O. Kolomiyets and M.-F. Moens. [https://lirias.kuleuven.be/bitstream/123456789/313539/1/KolomiyetsMoensIS2011.pdf "A survey on question answering technology from an information retrieval perspective."] Information Sciences, 181(24):5412–5434, 2011.</ref><ref>C. Unger, L. B¨uhmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano [http://liris.cnrs.fr/~pchampin/enseignement/semweb/_static/articles/unger_2012.pdf "Template-based Question Answering over RDF Data"] In Proceedings of the 21st international conference on World Wide Web, 2012.</ref><ref>X. Yao and B. Van Durme. [http://cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf "Information extraction over structured data: Question answering with freebase."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref>.<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer <ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from<br />
question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref name=six>T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. [http://yoavartzi.com/pub/kcaz-emnlp.2013.pdf "Scaling Semantic Parsers with On-the-fly Ontology Matching."] In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>J. Berant and P. Liang. [http://cs.stanford.edu/~pliang/papers/paraphrasing-acl2014.pdf "Semantic parsing via paraphrasing."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref><ref>A. Fader, L. Zettlemoyer, and O. Etzioni. [https://homes.cs.washington.edu/~lsz/papers/fze-kdd14.pdf "Open Question Answering Over Curated and Extracted Knowledge Bases."] In Proceedings of KDD’14. ACM, 2014.</ref><br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
<ref name=five>A. Bordes, J. Weston, and N. Usunier. [http://arxiv.org/pdf/1404.4326v1.pdf "Open question answering with weakly supervised embedding models."] In Proceedings of ECML-PKDD’14. Springer, 2014.</ref> proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of <ref name=five/> specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions <ref name=one/> was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per <ref name=one/> and <ref name=ten>T. Lin, O. Etzioni, et al. [http://aiweb.cs.washington.edu/research/projects/aiweb/media/papers/elaws.pdf "Entity Linking at Web Scale."] In Proceedings of the Joint<br />
Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 84–88. Association for Computational Linguistics, 2012.</ref>. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: <ref name=six/> harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed <ref name=fourteen/>, <ref name=one/> and <ref name=five/>, and performs similarly as <ref name=two/>.<br />
<br />
[[File:table3.JPG | center]]<br />
<br />
==Conclusion==<br />
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26084question Answering with Subgraph Embeddings2015-11-10T06:02:55Z<p>Trttse: /* Introduction */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref name=one>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [http://arxiv.org/pdf/1406.3676v3.pdf "Freebase: a collaboratively created graph database for structuring human knowledge."] In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics<ref name=two>O. Kolomiyets and M.-F. Moens. [https://lirias.kuleuven.be/bitstream/123456789/313539/1/KolomiyetsMoensIS2011.pdf "A survey on question answering technology from an information retrieval perspective."] Information Sciences, 181(24):5412–5434, 2011.</ref><ref>C. Unger, L. B¨uhmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano [http://liris.cnrs.fr/~pchampin/enseignement/semweb/_static/articles/unger_2012.pdf "Template-based Question Answering over RDF Data"] In Proceedings of the 21st international conference on World Wide Web, 2012.</ref><ref>X. Yao and B. Van Durme. [http://cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf "Information extraction over structured data: Question answering with freebase."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref>.<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer <ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from<br />
question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref name=six>T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. [http://yoavartzi.com/pub/kcaz-emnlp.2013.pdf "Scaling Semantic Parsers with On-the-fly Ontology Matching."] In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>J. Berant and P. Liang. [http://cs.stanford.edu/~pliang/papers/paraphrasing-acl2014.pdf "Semantic parsing via paraphrasing."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref><ref>A. Fader, L. Zettlemoyer, and O. Etzioni. [https://homes.cs.washington.edu/~lsz/papers/fze-kdd14.pdf "Open Question Answering Over Curated and Extracted Knowledge Bases."] In Proceedings of KDD’14. ACM, 2014.</ref><br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
<ref name=five>A. Bordes, J. Weston, and N. Usunier. [http://arxiv.org/pdf/1404.4326v1.pdf "Open question answering with weakly supervised embedding models."] In Proceedings of ECML-PKDD’14. Springer, 2014.</ref> proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of <ref name=five/> specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions <ref name=one/> was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per <ref name=one/> and <ref name=ten>T. Lin, O. Etzioni, et al. [http://aiweb.cs.washington.edu/research/projects/aiweb/media/papers/elaws.pdf "Entity Linking at Web Scale."] In Proceedings of the Joint<br />
Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 84–88. Association for Computational Linguistics, 2012.</ref>. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: <ref name=six/> harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed [14], <ref name=one/> and <ref name=five/>, and performs similarly as <ref name=two/>.<br />
<br />
[[File:table3.JPG | center]]<br />
<br />
==Conclusion==<br />
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26083question Answering with Subgraph Embeddings2015-11-10T06:02:33Z<p>Trttse: /* Experiments */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref name=one>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [http://arxiv.org/pdf/1406.3676v3.pdf "Freebase: a collaboratively created graph database for structuring human knowledge."] In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics<ref>O. Kolomiyets and M.-F. Moens. [https://lirias.kuleuven.be/bitstream/123456789/313539/1/KolomiyetsMoensIS2011.pdf "A survey on question answering technology from an information retrieval perspective."] Information Sciences, 181(24):5412–5434, 2011.</ref><ref>C. Unger, L. B¨uhmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano [http://liris.cnrs.fr/~pchampin/enseignement/semweb/_static/articles/unger_2012.pdf "Template-based Question Answering over RDF Data"] In Proceedings of the 21st international conference on World Wide Web, 2012.</ref><ref>X. Yao and B. Van Durme. [http://cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf "Information extraction over structured data: Question answering with freebase."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref>.<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer <ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from<br />
question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref name=six>T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. [http://yoavartzi.com/pub/kcaz-emnlp.2013.pdf "Scaling Semantic Parsers with On-the-fly Ontology Matching."] In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>J. Berant and P. Liang. [http://cs.stanford.edu/~pliang/papers/paraphrasing-acl2014.pdf "Semantic parsing via paraphrasing."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref><ref>A. Fader, L. Zettlemoyer, and O. Etzioni. [https://homes.cs.washington.edu/~lsz/papers/fze-kdd14.pdf "Open Question Answering Over Curated and Extracted Knowledge Bases."] In Proceedings of KDD’14. ACM, 2014.</ref><br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
<ref name=five>A. Bordes, J. Weston, and N. Usunier. [http://arxiv.org/pdf/1404.4326v1.pdf "Open question answering with weakly supervised embedding models."] In Proceedings of ECML-PKDD’14. Springer, 2014.</ref> proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of <ref name=five/> specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions <ref name=one/> was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per <ref name=one/> and <ref name=ten>T. Lin, O. Etzioni, et al. [http://aiweb.cs.washington.edu/research/projects/aiweb/media/papers/elaws.pdf "Entity Linking at Web Scale."] In Proceedings of the Joint<br />
Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 84–88. Association for Computational Linguistics, 2012.</ref>. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: <ref name=six/> harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed [14], <ref name=one/> and <ref name=five/>, and performs similarly as <ref name=two/>.<br />
<br />
[[File:table3.JPG | center]]<br />
<br />
==Conclusion==<br />
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26082question Answering with Subgraph Embeddings2015-11-10T06:00:56Z<p>Trttse: /* Task Definition */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref name=one>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [http://arxiv.org/pdf/1406.3676v3.pdf "Freebase: a collaboratively created graph database for structuring human knowledge."] In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics<ref>O. Kolomiyets and M.-F. Moens. [https://lirias.kuleuven.be/bitstream/123456789/313539/1/KolomiyetsMoensIS2011.pdf "A survey on question answering technology from an information retrieval perspective."] Information Sciences, 181(24):5412–5434, 2011.</ref><ref>C. Unger, L. B¨uhmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano [http://liris.cnrs.fr/~pchampin/enseignement/semweb/_static/articles/unger_2012.pdf "Template-based Question Answering over RDF Data"] In Proceedings of the 21st international conference on World Wide Web, 2012.</ref><ref>X. Yao and B. Van Durme. [http://cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf "Information extraction over structured data: Question answering with freebase."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref>.<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer <ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from<br />
question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref name=six>T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. [http://yoavartzi.com/pub/kcaz-emnlp.2013.pdf "Scaling Semantic Parsers with On-the-fly Ontology Matching."] In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>J. Berant and P. Liang. [http://cs.stanford.edu/~pliang/papers/paraphrasing-acl2014.pdf "Semantic parsing via paraphrasing."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref><ref>A. Fader, L. Zettlemoyer, and O. Etzioni. [https://homes.cs.washington.edu/~lsz/papers/fze-kdd14.pdf "Open Question Answering Over Curated and Extracted Knowledge Bases."] In Proceedings of KDD’14. ACM, 2014.</ref><br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
<ref name=five>A. Bordes, J. Weston, and N. Usunier. [http://arxiv.org/pdf/1404.4326v1.pdf "Open question answering with weakly supervised embedding models."] In Proceedings of ECML-PKDD’14. Springer, 2014.</ref> proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of <ref name=five/> specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions <ref name=one/> was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per <ref name=one/> and <ref name=ten>T. Lin, O. Etzioni, et al. [http://aiweb.cs.washington.edu/research/projects/aiweb/media/papers/elaws.pdf "Entity Linking at Web Scale."] In Proceedings of the Joint<br />
Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 84–88. Association for Computational Linguistics, 2012.</ref>. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: <ref name=six/> harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed [14], [1] and [5], and performs similarly as [2].<br />
<br />
[[File:table3.JPG | center]]<br />
<br />
==Conclusion==<br />
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26081question Answering with Subgraph Embeddings2015-11-10T05:57:04Z<p>Trttse: /* Introduction */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref name=one>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [http://arxiv.org/pdf/1406.3676v3.pdf "Freebase: a collaboratively created graph database for structuring human knowledge."] In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics<ref>O. Kolomiyets and M.-F. Moens. [https://lirias.kuleuven.be/bitstream/123456789/313539/1/KolomiyetsMoensIS2011.pdf "A survey on question answering technology from an information retrieval perspective."] Information Sciences, 181(24):5412–5434, 2011.</ref><ref>C. Unger, L. B¨uhmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano [http://liris.cnrs.fr/~pchampin/enseignement/semweb/_static/articles/unger_2012.pdf "Template-based Question Answering over RDF Data"] In Proceedings of the 21st international conference on World Wide Web, 2012.</ref><ref>X. Yao and B. Van Durme. [http://cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf "Information extraction over structured data: Question answering with freebase."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref>.<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer <ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from<br />
question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref name=six>T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. [http://yoavartzi.com/pub/kcaz-emnlp.2013.pdf "Scaling Semantic Parsers with On-the-fly Ontology Matching."] In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>J. Berant and P. Liang. [http://cs.stanford.edu/~pliang/papers/paraphrasing-acl2014.pdf "Semantic parsing via paraphrasing."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref><ref>A. Fader, L. Zettlemoyer, and O. Etzioni. [https://homes.cs.washington.edu/~lsz/papers/fze-kdd14.pdf "Open Question Answering Over Curated and Extracted Knowledge Bases."] In Proceedings of KDD’14. ACM, 2014.</ref><br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
<ref name=five>A. Bordes, J. Weston, and N. Usunier. [http://arxiv.org/pdf/1404.4326v1.pdf "Open question answering with weakly supervised embedding models."] In Proceedings of ECML-PKDD’14. Springer, 2014.</ref> proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of <ref name=five/> specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions <ref name=one/> was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per <ref name=one/> and [10]. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: [6] harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed [14], [1] and [5], and performs similarly as [2].<br />
<br />
[[File:table3.JPG | center]]<br />
<br />
==Conclusion==<br />
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26080question Answering with Subgraph Embeddings2015-11-10T05:52:28Z<p>Trttse: /* Task Definition */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref name=one>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [http://arxiv.org/pdf/1406.3676v3.pdf "Freebase: a collaboratively created graph database for structuring human knowledge."] In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics<ref>O. Kolomiyets and M.-F. Moens. [https://lirias.kuleuven.be/bitstream/123456789/313539/1/KolomiyetsMoensIS2011.pdf "A survey on question answering technology from an information retrieval perspective."] Information Sciences, 181(24):5412–5434, 2011.</ref><ref>C. Unger, L. B¨uhmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano [http://liris.cnrs.fr/~pchampin/enseignement/semweb/_static/articles/unger_2012.pdf "Template-based Question Answering over RDF Data"] In Proceedings of the 21st international conference on World Wide Web, 2012.</ref><ref>X. Yao and B. Van Durme. [http://cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf "Information extraction over structured data: Question answering with freebase."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref>.<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer <ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from<br />
question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. [http://yoavartzi.com/pub/kcaz-emnlp.2013.pdf "Scaling Semantic Parsers with On-the-fly Ontology Matching."] In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>J. Berant and P. Liang. [http://cs.stanford.edu/~pliang/papers/paraphrasing-acl2014.pdf "Semantic parsing via paraphrasing."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref><ref>A. Fader, L. Zettlemoyer, and O. Etzioni. [https://homes.cs.washington.edu/~lsz/papers/fze-kdd14.pdf "Open Question Answering Over Curated and Extracted Knowledge Bases."] In Proceedings of KDD’14. ACM, 2014.</ref><br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
<ref name=five>A. Bordes, J. Weston, and N. Usunier. [http://arxiv.org/pdf/1404.4326v1.pdf "Open question answering with weakly supervised embedding models."] In Proceedings of ECML-PKDD’14. Springer, 2014.</ref> proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of <ref name=five/> specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions <ref name=one/> was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per <ref name=one/> and [10]. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: [6] harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed [14], [1] and [5], and performs similarly as [2].<br />
<br />
[[File:table3.JPG | center]]<br />
<br />
==Conclusion==<br />
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26079question Answering with Subgraph Embeddings2015-11-10T05:51:53Z<p>Trttse: /* Introduction */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref name=one>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [http://arxiv.org/pdf/1406.3676v3.pdf "Freebase: a collaboratively created graph database for structuring human knowledge."] In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics<ref>O. Kolomiyets and M.-F. Moens. [https://lirias.kuleuven.be/bitstream/123456789/313539/1/KolomiyetsMoensIS2011.pdf "A survey on question answering technology from an information retrieval perspective."] Information Sciences, 181(24):5412–5434, 2011.</ref><ref>C. Unger, L. B¨uhmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano [http://liris.cnrs.fr/~pchampin/enseignement/semweb/_static/articles/unger_2012.pdf "Template-based Question Answering over RDF Data"] In Proceedings of the 21st international conference on World Wide Web, 2012.</ref><ref>X. Yao and B. Van Durme. [http://cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf "Information extraction over structured data: Question answering with freebase."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref>.<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer <ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from<br />
question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. [http://yoavartzi.com/pub/kcaz-emnlp.2013.pdf "Scaling Semantic Parsers with On-the-fly Ontology Matching."] In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>J. Berant and P. Liang. [http://cs.stanford.edu/~pliang/papers/paraphrasing-acl2014.pdf "Semantic parsing via paraphrasing."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref><ref>A. Fader, L. Zettlemoyer, and O. Etzioni. [https://homes.cs.washington.edu/~lsz/papers/fze-kdd14.pdf "Open Question Answering Over Curated and Extracted Knowledge Bases."] In Proceedings of KDD’14. ACM, 2014.</ref><br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
<ref name=five>A. Bordes, J. Weston, and N. Usunier. [http://arxiv.org/pdf/1404.4326v1.pdf "Open question answering with weakly supervised embedding models."] In Proceedings of ECML-PKDD’14. Springer, 2014.</ref> proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of <ref name=five/> specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions [1] was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per [1] and [10]. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: [6] harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed [14], [1] and [5], and performs similarly as [2].<br />
<br />
[[File:table3.JPG | center]]<br />
<br />
==Conclusion==<br />
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26078question Answering with Subgraph Embeddings2015-11-10T05:51:22Z<p>Trttse: /* Introduction */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [http://arxiv.org/pdf/1406.3676v3.pdf "Freebase: a collaboratively created graph database for structuring human knowledge."] In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics<ref>O. Kolomiyets and M.-F. Moens. [https://lirias.kuleuven.be/bitstream/123456789/313539/1/KolomiyetsMoensIS2011.pdf "A survey on question answering technology from an information retrieval perspective."] Information Sciences, 181(24):5412–5434, 2011.</ref><ref>C. Unger, L. B¨uhmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano [http://liris.cnrs.fr/~pchampin/enseignement/semweb/_static/articles/unger_2012.pdf "Template-based Question Answering over RDF Data"] In Proceedings of the 21st international conference on World Wide Web, 2012.</ref><ref>X. Yao and B. Van Durme. [http://cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf "Information extraction over structured data: Question answering with freebase."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref>.<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer <ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from<br />
question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. [http://yoavartzi.com/pub/kcaz-emnlp.2013.pdf "Scaling Semantic Parsers with On-the-fly Ontology Matching."] In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>J. Berant and P. Liang. [http://cs.stanford.edu/~pliang/papers/paraphrasing-acl2014.pdf "Semantic parsing via paraphrasing."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref><ref>A. Fader, L. Zettlemoyer, and O. Etzioni. [https://homes.cs.washington.edu/~lsz/papers/fze-kdd14.pdf "Open Question Answering Over Curated and Extracted Knowledge Bases."] In Proceedings of KDD’14. ACM, 2014.</ref><br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
<ref name=five>A. Bordes, J. Weston, and N. Usunier. [http://arxiv.org/pdf/1404.4326v1.pdf "Open question answering with weakly supervised embedding models."] In Proceedings of ECML-PKDD’14. Springer, 2014.</ref> proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of <ref name=five/> specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions [1] was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per [1] and [10]. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: [6] harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed [14], [1] and [5], and performs similarly as [2].<br />
<br />
[[File:table3.JPG | center]]<br />
<br />
==Conclusion==<br />
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26077question Answering with Subgraph Embeddings2015-11-10T05:50:25Z<p>Trttse: /* Introduction */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [http://arxiv.org/pdf/1406.3676v3.pdf "Freebase: a collaboratively created graph database for structuring human knowledge."] In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics<ref>O. Kolomiyets and M.-F. Moens. [https://lirias.kuleuven.be/bitstream/123456789/313539/1/KolomiyetsMoensIS2011.pdf "A survey on question answering technology from an information retrieval perspective."] Information Sciences, 181(24):5412–5434, 2011.</ref><ref>C. Unger, L. B¨uhmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano [http://liris.cnrs.fr/~pchampin/enseignement/semweb/_static/articles/unger_2012.pdf "Template-based Question Answering over RDF Data"] In Proceedings of the 21st international conference on World Wide Web, 2012.</ref><ref>X. Yao and B. Van Durme. [http://cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf "Information extraction over structured data: Question answering with freebase."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref>.<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer <ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from<br />
question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. [http://yoavartzi.com/pub/kcaz-emnlp.2013.pdf "Scaling Semantic Parsers with On-the-fly Ontology Matching."] In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>J. Berant and P. Liang. [http://cs.stanford.edu/~pliang/papers/paraphrasing-acl2014.pdf "Semantic parsing via paraphrasing."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref><ref>A. Fader, L. Zettlemoyer, and O. Etzioni. [https://homes.cs.washington.edu/~lsz/papers/fze-kdd14.pdf "Open Question Answering Over Curated and Extracted Knowledge Bases."] In Proceedings of KDD’14. ACM, 2014.</ref><br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
<ref name=5>A. Bordes, J. Weston, and N. Usunier. [http://arxiv.org/pdf/1404.4326v1.pdf "Open question answering with weakly supervised embedding models."] In Proceedings of ECML-PKDD’14. Springer, 2014.</ref> proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of <ref name=5/> specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions [1] was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per [1] and [10]. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: [6] harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed [14], [1] and [5], and performs similarly as [2].<br />
<br />
[[File:table3.JPG | center]]<br />
<br />
==Conclusion==<br />
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26076question Answering with Subgraph Embeddings2015-11-10T05:43:56Z<p>Trttse: /* Introduction */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [http://arxiv.org/pdf/1406.3676v3.pdf "Freebase: a collaboratively created graph database for structuring human knowledge."] In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics<ref>O. Kolomiyets and M.-F. Moens. [https://lirias.kuleuven.be/bitstream/123456789/313539/1/KolomiyetsMoensIS2011.pdf "A survey on question answering technology from an information retrieval perspective."] Information Sciences, 181(24):5412–5434, 2011.</ref><ref>C. Unger, L. B¨uhmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano [http://liris.cnrs.fr/~pchampin/enseignement/semweb/_static/articles/unger_2012.pdf "Template-based Question Answering over RDF Data"] In Proceedings of the 21st international conference on World Wide Web, 2012.</ref><ref>X. Yao and B. Van Durme. [http://cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf "Information extraction over structured data: Question answering with freebase."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref>.<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer <ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from<br />
question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. [http://yoavartzi.com/pub/kcaz-emnlp.2013.pdf "Scaling Semantic Parsers with On-the-fly Ontology Matching."] In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>J. Berant and P. Liang. [http://cs.stanford.edu/~pliang/papers/paraphrasing-acl2014.pdf "Semantic parsing via paraphrasing."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref><ref>A. Fader, L. Zettlemoyer, and O. Etzioni. [https://homes.cs.washington.edu/~lsz/papers/fze-kdd14.pdf "Open Question Answering Over Curated and Extracted Knowledge Bases."] In Proceedings of KDD’14. ACM, 2014.</ref><br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
[5] proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of [5] specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions [1] was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per [1] and [10]. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: [6] harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed [14], [1] and [5], and performs similarly as [2].<br />
<br />
[[File:table3.JPG | center]]<br />
<br />
==Conclusion==<br />
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26075question Answering with Subgraph Embeddings2015-11-10T05:34:15Z<p>Trttse: /* Introduction */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [http://arxiv.org/pdf/1406.3676v3.pdf "Freebase: a collaboratively created graph database for structuring human knowledge."] In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics<ref>O. Kolomiyets and M.-F. Moens. [https://lirias.kuleuven.be/bitstream/123456789/313539/1/KolomiyetsMoensIS2011.pdf "A survey on question answering technology from an information retrieval perspective."] Information Sciences, 181(24):5412–5434, 2011.</ref><ref>C. Unger, L. B¨uhmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano [http://liris.cnrs.fr/~pchampin/enseignement/semweb/_static/articles/unger_2012.pdf "Template-based Question Answering over RDF Data"] In Proceedings of the 21st international conference on World Wide Web, 2012.</ref><ref>X. Yao and B. Van Durme. [http://cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf "Information extraction over structured data: Question answering with freebase."] In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref>.<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer [1,9,2,7].<br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
[5] proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of [5] specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions [1] was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per [1] and [10]. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: [6] harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed [14], [1] and [5], and performs similarly as [2].<br />
<br />
[[File:table3.JPG | center]]<br />
<br />
==Conclusion==<br />
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26074question Answering with Subgraph Embeddings2015-11-10T05:29:22Z<p>Trttse: /* Introduction */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [http://arxiv.org/pdf/1406.3676v3.pdf "Freebase: a collaboratively created graph database for structuring human knowledge."] In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics <br />
<br />
<br />
<ref>O. Kolomiyets and M.-F. Moens. [https://lirias.kuleuven.be/bitstream/123456789/313539/1/KolomiyetsMoensIS2011.pdf "A survey on question answering technology from an information retrieval perspective."] Information Sciences, 181(24):5412–5434, 2011.</ref><br />
<br />
<ref>C. Unger, L. B¨uhmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano [http://liris.cnrs.fr/~pchampin/enseignement/semweb/_static/articles/unger_2012.pdf "Template-based Question Answering over RDF Data<br />
"] In Proceedings of the 21st international conference on World Wide Web, 2012.</ref><br />
<br />
<ref>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [http://arxiv.org/pdf/1406.3676v3.pdf "Freebase: a collaboratively created graph database for structuring human knowledge."] In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref><br />
<br />
<br />
<br />
<br />
<br />
[8,12,14].<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer [1,9,2,7].<br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
[5] proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of [5] specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions [1] was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per [1] and [10]. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: [6] harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed [14], [1] and [5], and performs similarly as [2].<br />
<br />
[[File:table3.JPG | center]]<br />
<br />
==Conclusion==<br />
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26073question Answering with Subgraph Embeddings2015-11-10T05:20:12Z<p>Trttse: /* Conclusion */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase [3], to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics [8,12,14].<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer [1,9,2,7].<br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
[5] proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of [5] specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions [1] was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per [1] and [10]. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: [6] harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed [14], [1] and [5], and performs similarly as [2].<br />
<br />
[[File:table3.JPG | center]]<br />
<br />
==Conclusion==<br />
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26072question Answering with Subgraph Embeddings2015-11-10T05:12:29Z<p>Trttse: /* Experiments */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase [3], to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics [8,12,14].<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer [1,9,2,7].<br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
[5] proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of [5] specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions [1] was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per [1] and [10]. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: [6] harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed [14], [1] and [5], and performs similarly as [2].<br />
<br />
[[File:table3.JPG | center]]<br />
<br />
==Conclusion==</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26071question Answering with Subgraph Embeddings2015-11-10T05:11:55Z<p>Trttse: /* Experiments */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase [3], to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics [8,12,14].<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer [1,9,2,7].<br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
[5] proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of [5] specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions [1] was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per [1] and [10]. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: [6] harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
Table 3 below indicates that their approach outperformed [14], [1] and [5], and performs similarly as [2].<br />
<br />
==Conclusion==</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26070question Answering with Subgraph Embeddings2015-11-10T05:06:34Z<p>Trttse: /* Inference */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase [3], to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics [8,12,14].<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer [1,9,2,7].<br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
[5] proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of [5] specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions [1] was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per [1] and [10]. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: [6] harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
<math>A(q)</math> is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as <math>C_1</math>.<br />
<br />
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted <math>C_2</math> is used by default.<br />
<br />
==Experiments==<br />
<br />
==Conclusion==</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26069question Answering with Subgraph Embeddings2015-11-10T03:22:33Z<p>Trttse: /* Inference */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase [3], to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics [8,12,14].<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer [1,9,2,7].<br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
[5] proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of [5] specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions [1] was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per [1] and [10]. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: [6] harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
where <math>A(q)</math> is the candidate answer set. For speed and precision issues, we create a candidate set <math>A(q)</math> for each question.<br />
<br />
==Experiments==<br />
<br />
==Conclusion==</div>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=question_Answering_with_Subgraph_Embeddings&diff=26063question Answering with Subgraph Embeddings2015-11-10T03:13:11Z<p>Trttse: /* Inference */</p>
<hr />
<div>==Introduction==<br />
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase [3], to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.<br />
<br />
open QA techniques can be classified into two main categories:<br />
*Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics [8,12,14].<br />
*Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer [1,9,2,7].<br />
<br />
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.<br />
<br />
[5] proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of [5] specifically with the contributions of:<br />
*A more sophisticated inference procedure that is more efficient and can consider longer paths.<br />
*A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.<br />
<br />
==Task Definition==<br />
Motivation is to provide a system for open QA able to be trained as long as:<br />
*A training set of questions paired with answers.<br />
*A KB providing a structure among answers.<br />
<br />
WebQuestions [1] was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.<br />
*WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).<br />
*Freebase: is a huge database of general facts that are organized in triplets (<code>subject</code>, <code>type1.type2.predicate</code>, <code>object</code>). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is the <code>predicate</code> of the <code>type2 subject</code>"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). <br />
*ClubWeb Extractions: The team also used ClueWeb extractions as per [1] and [10]. ClueWeb has the format (<code>subject</code>, "text string", <code>object</code>) and it was ensured that both the <code>subject</code> and <code>object</code> was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types.<br />
*Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: [6] harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.<br />
<br />
Table 2 shows some examples sentences from each dataset category.<br />
<br />
[[File:table2.JPG | center]]<br />
<br />
==Embedding Questions and Answers==<br />
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let ''q'' denote a question and ''a'' denote an answer. Learning embeddings is achieved by learning a score function ''S''(''q'', ''a'') so that ''S'' generates a high score if ''a'' is the correct answer to ''q'', and a low score otherwise.<br />
<br />
:<math> S(q, a) = f(q)^\mathrm{T} g(a) \,</math><br />
<br />
<br />
Let <math>\mathbf{W}</math> be a matrix of <math>\mathbb{R}^{k \times N}</math>, where ''k'' is the dimension of the embedding space and ''N'' is the dictionary of embedding to be learned. The function <math>f(\cdot)</math> which maps the questions into the embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>f(q) = \mathbf{W}\phi(q)</math>, where <math>\phi(q) \in \mathbb{N}^N</math>, is a sparse vector indicating the number of times each word appears in the question ''q''. Likewise, the function <math>g(\cdot)</math> which maps the answers to the same embedding space <math>\mathbb{R}^{k}</math>, is defined as <math>g(a) = \mathbf{W}\psi(a)</math>, where <math>\psi(a) \in \mathbb{N}^N</math>, is a sparse vector representation of the answer ''a''. Figure 1 below depicts the subgraph embedding model.<br />
<br />
[[File:embedding_model.JPG | center]]<br />
<br />
<br />
===Representing Candidate Answers===<br />
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.<br />
<br />
:(i) Single Entity: The answer is represented as a single entity from Freebase. <math>\psi(a)</math> is a 1-of-<math>N_S</math> coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.<br />
:(ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a <math>\psi(a)</math> which is 3-of-<math>N_S</math> or 4-of-<math>N_S</math>.<br />
:(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.<br />
<br />
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.<br />
<br />
===Training and Loss Function===<br />
The model was trained using a margin-based ranking loss function. Let <math>D = {(q_i, a_i) : i = 1,..., |D|}</math> be the training set of questions <math>q_i</math> paired with their correct answer <math>a_i</math>. The loss function we minimize is<br />
<br />
:<math>\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},</math><br />
<br />
where ''m'' is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix <math>\mathbf W</math> so the score of a question paired with a correct answer is greater than any incorrect answer <math>\overline{a}</math> by at least ''m''.<br />
<br />
===Multitask Training of Embeddings===<br />
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of ''S'' with another scoring function defined as <math>S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)</math> which uses the same embedding matrix <math>\mathbf{W}</math> and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.<br />
<br />
===Inference===<br />
Once <math>\mathbf{W}</math> is trained, at test time, for a given question ''q'' the model predicts the answer with:<br />
<br />
:<math>\hat{a} = argmax_{a^' \in A(q)} S(q, a')</math><br />
<br />
==Experiments==<br />
<br />
==Conclusion==</div>Trttse