http://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&feed=atom&action=historyon using very large target vocabulary for neural machine translation - Revision history2024-03-29T08:55:07ZRevision history for this page on the wikiMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=27748&oldid=prevConversion script: Conversion script moved page On using very large target vocabulary for neural machine translation to on using very large target vocabulary for neural machine translation: Converting page titles to lowercase2017-08-30T13:46:34Z<p>Conversion script moved page <a href="/statwiki/index.php?title=On_using_very_large_target_vocabulary_for_neural_machine_translation" class="mw-redirect" title="On using very large target vocabulary for neural machine translation">On using very large target vocabulary for neural machine translation</a> to <a href="/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation" title="on using very large target vocabulary for neural machine translation">on using very large target vocabulary for neural machine translation</a>: Converting page titles to lowercase</p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<tr class="diff-title" lang="us">
<td colspan="1" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="1" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 09:46, 30 August 2017</td>
</tr><tr><td colspan="2" class="diff-notice" lang="us"><div class="mw-diff-empty">(No difference)</div>
</td></tr></table>Conversion scripthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=27352&oldid=prevRtwang at 02:09, 18 December 20152015-12-18T02:09:13Z<p></p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="us">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 22:09, 17 December 2015</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l3">Line 3:</td>
<td colspan="2" class="diff-lineno">Line 3:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation"</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation"</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><ref>S. Jean, K. Cho, R Memisevic, and Y. Bengio. [http://arxiv.org/pdf/1412.2007v2.pdf "On Using Very Large Target Vocabulary for Neural Machine Translation"], 2015.</ref></div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><ref>S. Jean, K. Cho, R Memisevic, and Y. Bengio. [http://arxiv.org/pdf/1412.2007v2.pdf "On Using Very Large Target Vocabulary for Neural Machine Translation"], 2015.</ref></div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The paper presents the application of importance sampling for neural machine translation with a very large target vocabulary. Despite the advantages of neural networks in translation over <del style="font-weight: bold; text-decoration: none;">the </del>statistical machine translation systems<del style="font-weight: bold; text-decoration: none;">, </del>such as the phrase-based system, they suffer from some technical problems. Most importantly, they are limited to <del style="font-weight: bold; text-decoration: none;">work with </del>a small vocabulary because of complexity and <del style="font-weight: bold; text-decoration: none;">the </del>number of parameters that have to be trained. <del style="font-weight: bold; text-decoration: none;">To explain, the </del>output <del style="font-weight: bold; text-decoration: none;">layer </del>of <del style="font-weight: bold; text-decoration: none;">an </del>RNN used for machine translation will have as many <del style="font-weight: bold; text-decoration: none;">units </del>as there are <del style="font-weight: bold; text-decoration: none;">items </del>in the vocabulary. If the vocabulary <del style="font-weight: bold; text-decoration: none;">has </del>hundreds of thousand of <del style="font-weight: bold; text-decoration: none;">terms</del>, then the RNN must compute a very expensive softmax on the output <del style="font-weight: bold; text-decoration: none;">units </del>at each time step <del style="font-weight: bold; text-decoration: none;">when predicting an output </del>sequence. <del style="font-weight: bold; text-decoration: none;">Moreover</del>, the number of parameters in the RNN will also grow very large in such cases<del style="font-weight: bold; text-decoration: none;">, </del>given that number of weights between the hidden layer and output layer will be equal to the product of the number of units in each layer. For a non-<del style="font-weight: bold; text-decoration: none;">trivially </del>sized hidden layer, a large vocabulary could result in tens of millions of model parameters <del style="font-weight: bold; text-decoration: none;">just </del>associated with the hidden-to-output mapping <del style="font-weight: bold; text-decoration: none;">performed by the model</del>. In practice, researchers who apply RNNs to machine translation have avoided this problem by restricting the model vocabulary to only include some shortlist of words in the target language. Words not in this shortlist are treated as unknown by the model and assigned a special 'UNK' token. This technique understandably impairs translation performance when the target sentence includes a large number of words not present in the vocabulary. </div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The paper presents the application of importance sampling for neural machine translation with a very large target vocabulary. Despite the advantages of neural networks in <ins style="font-weight: bold; text-decoration: none;">machine </ins>translation over statistical machine translation systems such as the phrase-based system, they suffer from some technical problems. Most importantly, they are limited to a small vocabulary because of complexity and number of parameters that have to be trained <ins style="font-weight: bold; text-decoration: none;">as total vocabulary increases</ins>. <ins style="font-weight: bold; text-decoration: none;">The </ins>output of <ins style="font-weight: bold; text-decoration: none;">a </ins>RNN used for machine translation will have as many <ins style="font-weight: bold; text-decoration: none;">dimensions </ins>as there are <ins style="font-weight: bold; text-decoration: none;">words </ins>in the vocabulary. If the <ins style="font-weight: bold; text-decoration: none;">total </ins>vocabulary <ins style="font-weight: bold; text-decoration: none;">consists of </ins>hundreds of thousand of <ins style="font-weight: bold; text-decoration: none;">words</ins>, then the RNN must compute a very expensive softmax on the output <ins style="font-weight: bold; text-decoration: none;">vector </ins>at each time step <ins style="font-weight: bold; text-decoration: none;">and estimate the probability of each word as the next word in the </ins>sequence. <ins style="font-weight: bold; text-decoration: none;">Therefore</ins>, the number of parameters in the RNN will also grow very large in such cases given that number of weights between the hidden layer and output layer will be equal to the product of the number of units in each layer. For a non-<ins style="font-weight: bold; text-decoration: none;">trivial </ins>sized hidden layer, a large vocabulary could result in tens of millions of model parameters <ins style="font-weight: bold; text-decoration: none;">purely </ins>associated with the hidden-to-output mapping. In practice, researchers who apply RNNs to machine translation have avoided this problem by restricting the model vocabulary to only include some shortlist of words in the target language. Words not in this shortlist are treated as unknown by the model and assigned a special 'UNK' token. This technique understandably impairs translation performance when the target sentence includes a large number of words not present in the vocabulary <ins style="font-weight: bold; text-decoration: none;">such as names</ins>. </div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>In this paper Jean and his colleagues aim to solve this problem by proposing a training method based on importance sampling which uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrates better performance without losing efficiency in time or speed. The algorithm is tested on two machine translation tasks (English <math>\rightarrow</math> German, and English <math>\rightarrow</math> French), and it achieved the best performance <del style="font-weight: bold; text-decoration: none;">achieved by </del>any previous single neural machine translation (NMT) system on the English <math>\rightarrow</math> French translation task.</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>In this paper Jean and his colleagues aim to solve this problem by proposing a training method based on importance sampling which uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrates better performance without losing efficiency in time or speed. The algorithm is tested on two machine translation tasks (English <math>\rightarrow</math> German, and English <math>\rightarrow</math> French), and it achieved the best performance <ins style="font-weight: bold; text-decoration: none;">out of </ins>any previous single neural machine translation (NMT) system on the English <math>\rightarrow</math> French translation task.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>==Methods==</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>==Methods==</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Recall that the classic neural machine <del style="font-weight: bold; text-decoration: none;">learning plays as </del>encoder-decoder network. The encoder reads the source sentence x and encode it into a sequence of hidden states of h where <math>h_t=f(x_t,h_{t-1})</math>. In the decoder step, another neural network generates the translation vector of y based on the encoded sequence of hidden states h: <math>p(y_t\,|\,y_{<t},x)\propto \exp\{q(y_{t-1}, z_t, c_t)\}</math> where <math>\, z_t=g(y_{t-1}, z_{t-1}, c_t)</math> and <math>\, c_t=r(z_{t-1}, h_1,..., H_T)</math></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Recall that the classic neural machine <ins style="font-weight: bold; text-decoration: none;">translation system works through an </ins>encoder-decoder network. The encoder reads the source sentence x and encode it into a sequence of hidden states of h where <math>h_t=f(x_t,h_{t-1})</math>. In the decoder step, another neural network generates the translation vector of y based on the encoded sequence of hidden states h: <math>p(y_t\,|\,y_{<t},x)\propto \exp\{q(y_{t-1}, z_t, c_t)\}</math> where <math>\, z_t=g(y_{t-1}, z_{t-1}, c_t)</math> and <math>\, c_t=r(z_{t-1}, h_1,..., H_T)</math></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The objective function which have to be maximized represented by </div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The objective function which have to be maximized represented by </div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l31">Line 31:</td>
<td colspan="2" class="diff-lineno">Line 31:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The dot product between the feature <math>\phi(y_{t-1}, z_t, c_t)</math> and <math>w_t</math> is required to be done for all words in target vocabulary <del style="font-weight: bold; text-decoration: none;">that </del>is computationally complex and time consuming. Furthermore, the memory requirements grow linearly with respect to the number of target word. This has been a major hurdle for neural machine translations. Recent approaches use a shortlist of 30,000 to 80,000 most frequent words. This makes training <del style="font-weight: bold; text-decoration: none;">for </del>feasible but also has problems of its own. For example, the model degrades heavily if the translation of the source sentence requires many words that are not included in the shortlist. The approach of this paper uses only a subset of sampled target words as <del style="font-weight: bold; text-decoration: none;">a </del>align vector to maximize Eq (6), instead of all the likely target words. The most naïve way to select a subset of target words is selection of K most frequent words. However, This skipping words from training processes is in contrast with using a large vocabulary, because practically we removed a bunch of words from target dictionary. Jean et al., proposed using an existing word alignment model to align the source and target words in the training corpus and build a dictionary. With the dictionary, for each source sentence, we construct a target word set consisting of the K-most frequent words (according to the estimated unigram probability) and, using the dictionary, at most <math>k\prime</math> likely target words for each source word. K and <math>k\prime</math> may be chosen either to meet the computational requirement or to maximize the translation performance on the development set. </div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The dot product between the feature <math>\phi(y_{t-1}, z_t, c_t)</math> and <math>w_t</math> is required to be done for all words in target vocabulary <ins style="font-weight: bold; text-decoration: none;">and </ins>is computationally complex and time consuming. Furthermore, the memory requirements grow linearly with respect to the number of target word. This has been a major hurdle for neural machine translations. Recent approaches use a shortlist of 30,000 to 80,000 most frequent words. This makes training <ins style="font-weight: bold; text-decoration: none;">more </ins>feasible but also has problems of its own. For example, the model degrades heavily if the translation of the source sentence requires many words that are not included in the shortlist. The approach of this paper uses only a subset of sampled target words as <ins style="font-weight: bold; text-decoration: none;">an </ins>align vector to maximize Eq (6), instead of all the likely target words. The most naïve way to select a subset of target words is selection of K most frequent words. However, This skipping <ins style="font-weight: bold; text-decoration: none;">of </ins>words from training processes is in contrast with using a large vocabulary, because practically we removed a bunch of words from target dictionary. Jean et al., proposed using an existing word alignment model to align the source and target words in the training corpus and build a dictionary. With the dictionary, for each source sentence, we construct a target word set consisting of the K-most frequent words (according to the estimated unigram probability) and, using the dictionary, at most <math>k\prime</math> likely target words for each source word. K and <math>k\prime</math> may be chosen either to meet the computational requirement or to maximize the translation performance on the development set. </div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In order to avoid the growing complexity of computing the normalization constant, the authors proposed to use only a small subset <math>v\prime</math> of the target vocabulary at each update<ref></div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In order to avoid the growing complexity of computing the normalization constant, the authors proposed to use only a small subset <math>v\prime</math> of the target vocabulary at each update<ref></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Bengio and Sen´ et al, [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4443871.pdf Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model ],IEEEXplor, 2008</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Bengio and Sen´ et al, [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4443871.pdf Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model ],IEEEXplor, 2008</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l54">Line 54:</td>
<td colspan="2" class="diff-lineno">Line 54:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>==Setting==</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>==Setting==</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>As a baseline for English→French translation, the authors used the RNNsearch model proposed by (Bahdanau et al., 2014), with 30,000 source and target words <del style="font-weight: bold; text-decoration: none;"> </del>and <del style="font-weight: bold; text-decoration: none;">also </del>another RNNsearch was trained for English→German translation with 50,000 source and target words. Using the proposed approach another set of RNNsearch models with much larger vocabularies of 500,000 source and target words was trained for each language pair. Different shortlist sizes used during training: 15,000 and 30,000 for English→French, and 15,000 and 50,000 for English→German. The best performance on the development set were evaluated and reported every twelve hours. For both language pairs, new models were trained with shortlist size of 15, 000 and 50, 000 by reshuffling the <del style="font-weight: bold; text-decoration: none;">dataset </del>at the beginning of each epoch. While this causes a non-negligible amount of overhead, such a change allows words to be contrasted with different sets of <del style="font-weight: bold; text-decoration: none;">other </del>words each epoch. The beam search was used to generate a translation given a source. The authors keep a set of 12 hypotheses and normalize probabilities by the length of the candidate sentences which was chosen to maximize the performance on the development set, for K ∈ {15k, 30k, 50k} and K0 ∈ {10, 20}. They <del style="font-weight: bold; text-decoration: none;">test using </del>a bilingual dictionary to accelerate decoding and to replace unknown words in translations.</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>As a baseline for English→French translation, the authors used the RNNsearch model proposed by (Bahdanau et al., 2014), with 30,000 source and target words and another RNNsearch was trained for English→German translation with 50,000 source and target words. Using the proposed approach another set of RNNsearch models with much larger vocabularies of 500,000 source and target words was trained for each language pair. Different shortlist sizes used during training: 15,000 and 30,000 for English→French, and 15,000 and 50,000 for English→German. The best performance on the development set were evaluated and reported every twelve hours. For both language pairs, new models were trained with shortlist size of 15, 000 and 50, 000 by reshuffling the <ins style="font-weight: bold; text-decoration: none;">data set </ins>at the beginning of each epoch. While this causes a non-negligible amount of overhead, such a change allows words to be contrasted with different sets of words <ins style="font-weight: bold; text-decoration: none;">for </ins>each epoch. The beam search was used to generate a translation given a source. The authors keep a set of 12 hypotheses and normalize probabilities by the length of the candidate sentences which was chosen to maximize the performance on the development set, for K ∈ {15k, 30k, 50k} and K0 ∈ {10, 20}. They <ins style="font-weight: bold; text-decoration: none;">used </ins>a bilingual dictionary to accelerate decoding and to replace unknown words in translations.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>==Results==</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>==Results==</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l136">Line 136:</td>
<td colspan="2" class="diff-lineno">Line 136:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>|}</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>|}</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>It is clear that the RNNsearch-LV outperforms the baseline RNNsearch. In the case of the English→French task, RNNsearch-LV approached the performance level of the previous best single neural machine translation (NMT) model, even without any <del style="font-weight: bold; text-decoration: none;">translationspecific </del>techniques. With these, however, the RNNsearch-LV outperformed it. The performance of the RNNsearch-LV is also better than that of a standard phrase-based translation system. </div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>It is clear that the RNNsearch-LV outperforms the baseline RNNsearch. In the case of the English→French task, RNNsearch-LV approached the performance level of the previous best single neural machine translation (NMT) model, even without any <ins style="font-weight: bold; text-decoration: none;">translation specific </ins>techniques. With these, however, the RNNsearch-LV outperformed it. The performance of the RNNsearch-LV is also better than that of a standard phrase-based translation system. </div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>For English→German, the RNNsearch-LV outperformed the baseline before unknown word replacement, but after doing so, the two systems performed similarly. A higher large vocabulary single-model performance is achieved by reshuffling the <del style="font-weight: bold; text-decoration: none;">dataset</del>. In this case, we were able to surpass the previously reported best translation result on this task by building an ensemble of 8 models. With τ = 15, 000, the RNNsearch-LV performance worsened a little, with best BLEU scores, without reshuffling, of 33.76 and 18.59 respectively for English→French and English→German.</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>For English→German, the RNNsearch-LV outperformed the baseline before unknown word replacement, but after doing so, the two systems performed similarly. A higher large vocabulary single-model performance is achieved by reshuffling the <ins style="font-weight: bold; text-decoration: none;">data set</ins>. In this case, we were able to surpass the previously reported best translation result on this task by building an ensemble of 8 models. With τ = 15, 000, the RNNsearch-LV performance worsened a little, with best BLEU scores, without reshuffling, of 33.76 and 18.59 respectively for English→French and English→German.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The timing information of decoding for different models were presented in Table below. While decoding from RNNsearch-LV with the full target vocabulary is slowest, the speed substantially improves if a candidate list for decoding each translation is used. </div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The timing information of decoding for different models were presented in Table below. While decoding from RNNsearch-LV with the full target vocabulary is slowest, the speed substantially improves if a candidate list for decoding each translation is used. </div></td></tr>
</table>Rtwanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=27342&oldid=prevTrttse: /* Methods */2015-12-16T21:43:42Z<p><span dir="auto"><span class="autocomment">Methods</span></span></p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="us">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 17:43, 16 December 2015</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l31">Line 31:</td>
<td colspan="2" class="diff-lineno">Line 31:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The dot product between the feature <math>\phi(y_{t-1}, z_t, c_t)</math> and <math>w_t</math> is required to be done for all words in target vocabulary that is computationally complex and time consuming. </div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The dot product between the feature <math>\phi(y_{t-1}, z_t, c_t)</math> and <math>w_t</math> is required to be done for all words in target vocabulary that is computationally complex and time consuming. <ins style="font-weight: bold; text-decoration: none;">Furthermore, the memory requirements grow linearly with respect to the number of target word. This has been a major hurdle for neural machine translations. Recent approaches use a shortlist of 30,000 to 80,000 most frequent words. This makes training for feasible but also has problems of its own. For example, the model degrades heavily if the translation of the source sentence requires many words that are not included in the shortlist. </ins>The approach of this paper uses only a subset of sampled target words as a align vector to maximize Eq (6), instead of all the likely target words. The most naïve way to select a subset of target words is selection of K most frequent words. However, This skipping words from training processes is in contrast with using a large vocabulary, because practically we removed a bunch of words from target dictionary. Jean et al., proposed using an existing word alignment model to align the source and target words in the training corpus and build a dictionary. With the dictionary, for each source sentence, we construct a target word set consisting of the K-most frequent words (according to the estimated unigram probability) and, using the dictionary, at most <math>k\prime</math> likely target words for each source word. K and <math>k\prime</math> may be chosen either to meet the computational requirement or to maximize the translation performance on the development set. </div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The approach of this paper uses only a subset of sampled target words as a align vector to maximize Eq (6), instead of all the likely target words. The most naïve way to select a subset of target words is selection of K most frequent words. However, This skipping words from training processes is in contrast with using a large vocabulary, because practically we removed a bunch of words from target dictionary. Jean et al., proposed using an existing word alignment model to align the source and target words in the training corpus and build a dictionary. With the dictionary, for each source sentence, we construct a target word set consisting of the K-most frequent words (according to the estimated unigram probability) and, using the dictionary, at most <math>k\prime</math> likely target words for each source word. K and <math>k\prime</math> may be chosen either to meet the computational requirement or to maximize the translation performance on the development set. </div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In order to avoid the growing complexity of computing the normalization constant, the authors proposed to use only a small subset <math>v\prime</math> of the target vocabulary at each update<ref></div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In order to avoid the growing complexity of computing the normalization constant, the authors proposed to use only a small subset <math>v\prime</math> of the target vocabulary at each update<ref></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Bengio and Sen´ et al, [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4443871.pdf Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model ],IEEEXplor, 2008</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Bengio and Sen´ et al, [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4443871.pdf Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model ],IEEEXplor, 2008</div></td></tr>
</table>Trttsehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=27335&oldid=prevArashwan: /* Overview */2015-12-16T19:08:37Z<p><span dir="auto"><span class="autocomment">Overview</span></span></p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="us">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 15:08, 16 December 2015</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l5">Line 5:</td>
<td colspan="2" class="diff-lineno">Line 5:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The paper presents the application of importance sampling for neural machine translation with a very large target vocabulary. Despite the advantages of neural networks in translation over the statistical machine translation systems, such as the phrase-based system, they suffer from some technical problems. Most importantly, they are limited to work with a small vocabulary because of complexity and the number of parameters that have to be trained. To explain, the output layer of an RNN used for machine translation will have as many units as there are items in the vocabulary. If the vocabulary has hundreds of thousand of terms, then the RNN must compute a very expensive softmax on the output units at each time step when predicting an output sequence. Moreover, the number of parameters in the RNN will also grow very large in such cases, given that number of weights between the hidden layer and output layer will be equal to the product of the number of units in each layer. For a non-trivially sized hidden layer, a large vocabulary could result in tens of millions of model parameters just associated with the hidden-to-output mapping performed by the model. In practice, researchers who apply RNNs to machine translation have avoided this problem by restricting the model vocabulary to only include some shortlist of words in the target language. Words not in this shortlist are treated as unknown by the model and assigned a special 'UNK' token. This technique understandably impairs translation performance when the target sentence includes a large number of words not present in the vocabulary. </div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The paper presents the application of importance sampling for neural machine translation with a very large target vocabulary. Despite the advantages of neural networks in translation over the statistical machine translation systems, such as the phrase-based system, they suffer from some technical problems. Most importantly, they are limited to work with a small vocabulary because of complexity and the number of parameters that have to be trained. To explain, the output layer of an RNN used for machine translation will have as many units as there are items in the vocabulary. If the vocabulary has hundreds of thousand of terms, then the RNN must compute a very expensive softmax on the output units at each time step when predicting an output sequence. Moreover, the number of parameters in the RNN will also grow very large in such cases, given that number of weights between the hidden layer and output layer will be equal to the product of the number of units in each layer. For a non-trivially sized hidden layer, a large vocabulary could result in tens of millions of model parameters just associated with the hidden-to-output mapping performed by the model. In practice, researchers who apply RNNs to machine translation have avoided this problem by restricting the model vocabulary to only include some shortlist of words in the target language. Words not in this shortlist are treated as unknown by the model and assigned a special 'UNK' token. This technique understandably impairs translation performance when the target sentence includes a large number of words not present in the vocabulary. </div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>In this paper Jean and his colleagues aim to solve this problem by proposing a training method based on importance sampling which uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrates better performance without losing efficiency in time or speed.</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>In this paper Jean and his colleagues aim to solve this problem by proposing a training method based on importance sampling which uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrates better performance without losing efficiency in time or speed<ins style="font-weight: bold; text-decoration: none;">. The algorithm is tested on two machine translation tasks (English <math>\rightarrow</math> German, and English <math>\rightarrow</math> French), and it achieved the best performance achieved by any previous single neural machine translation (NMT) system on the English <math>\rightarrow</math> French translation task</ins>.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>==Methods==</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>==Methods==</div></td></tr>
</table>Arashwanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=27329&oldid=prevMhpanju: /* Methods */2015-12-15T21:52:06Z<p><span dir="auto"><span class="autocomment">Methods</span></span></p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="us">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 17:52, 15 December 2015</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l9">Line 9:</td>
<td colspan="2" class="diff-lineno">Line 9:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>==Methods==</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>==Methods==</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Recall that the classic neural machine learning plays as encoder-decoder network. The encoder reads the source sentence x and encode it into a sequence of hidden states of h where <math>h_t=f(x_t,h_{t-1})</math>. In the decoder step, another neural network generates the translation vector of y based on the encoded sequence of hidden states h: <math>p(y_t\,|\,y_{<t},x)\propto exp\{q(y_{t-1}, z_t, c_t)\}</math> where <math>z_t=g(y_{t-1}, z_{t-1}, c_t)</math> and <math>c_t=r(z_{t-1}, h_1,..., H_T)</math></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Recall that the classic neural machine learning plays as encoder-decoder network. The encoder reads the source sentence x and encode it into a sequence of hidden states of h where <math>h_t=f(x_t,h_{t-1})</math>. In the decoder step, another neural network generates the translation vector of y based on the encoded sequence of hidden states h: <math>p(y_t\,|\,y_{<t},x)\propto <ins style="font-weight: bold; text-decoration: none;">\</ins>exp\{q(y_{t-1}, z_t, c_t)\}</math> where <math><ins style="font-weight: bold; text-decoration: none;">\, </ins>z_t=g(y_{t-1}, z_{t-1}, c_t)</math> and <math><ins style="font-weight: bold; text-decoration: none;">\, </ins>c_t=r(z_{t-1}, h_1,..., H_T)</math></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The objective function which have to be maximized represented by </div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The objective function which have to be maximized represented by </div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><math>\theta=<del style="font-weight: bold; text-decoration: none;">argmax</del>\sum_{n=1}^{N}\sum_{t=1}^{T_n}<del style="font-weight: bold; text-decoration: none;">logp</del>(y_t^n\,|\,y_{<t}^n, x^n)</math></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><math>\theta=<ins style="font-weight: bold; text-decoration: none;">\arg\max</ins>\sum_{n=1}^{N}\sum_{t=1}^{T_n}<ins style="font-weight: bold; text-decoration: none;">\log p</ins>(y_t^n\,|\,y_{<t}^n, x^n)</math></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> </div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>where <math>(x^n, y^n)</math> is the n-th training pair of sentence, and <math>T_n</math> is the length of n-th target sentence <math>y^n</math>.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>where <math>(x^n, y^n)</math> is the n-th training pair of sentence, and <math>T_n</math> is the length of n-th target sentence <math>y^n</math>.</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l21">Line 21:</td>
<td colspan="2" class="diff-lineno">Line 21:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>vector <math>c_t</math> as a convex sum of the hidden states <math>(h_1,...,h_T)</math> with the coefficients <math>(\alpha_1,...,\alpha_T)</math> computed by</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>vector <math>c_t</math> as a convex sum of the hidden states <math>(h_1,...,h_T)</math> with the coefficients <math>(\alpha_1,...,\alpha_T)</math> computed by</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><math>\alpha_t=\frac{exp\{a(h_t, z_t)\}}{\sum_{k}exp\{a(h_t, z_t)\}}</math></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><math>\alpha_t=\frac{<ins style="font-weight: bold; text-decoration: none;">\</ins>exp\{a(h_t, z_t)\}}{\sum_{k}<ins style="font-weight: bold; text-decoration: none;">\</ins>exp\{a(h_t, z_t)\}}</math></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>where a is a feedforward neural network with a single hidden layer. </div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>where a is a feedforward neural network with a single hidden layer. </div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Then the probability of the next target word is </div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Then the probability of the next target word is </div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><math>p(y_t\ y_{<t}, x)=\frac{1}{Z} exp\{W_t^T\phi(y_{t-1}, z_t, c_t)+b_t\}</math>. In that <math>\phi</math> is an affine transformation followed by a nonlinear activation, <math>w_t</math> and <math>b_t</math> are the target word vector and the target word bias, respectively. Z is the normalization constant computed by</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><math>p(y_t\ y_{<t}, x)=\frac{1}{Z} <ins style="font-weight: bold; text-decoration: none;">\</ins>exp\{W_t^T\phi(y_{t-1}, z_t, c_t)+b_t\}</math>. In that <math>\phi</math> is an affine transformation followed by a nonlinear activation, <math>w_t</math> and <math>b_t</math> are the target word vector and the target word bias, respectively. Z is the normalization constant computed by</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><math> Z=\sum_{k:y_k\in V}exp\<del style="font-weight: bold; text-decoration: none;">{</del>W_t^T\phi(y_{t-1}, z_t, c_t)+b_t</math> where V is set of all the target words. </div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><math> Z=\sum_{k:y_k\in V}<ins style="font-weight: bold; text-decoration: none;">\</ins>exp\<ins style="font-weight: bold; text-decoration: none;">left(</ins>W_t^T\phi(y_{t-1}, z_t, c_t)+b_t<ins style="font-weight: bold; text-decoration: none;">\right)</ins></math> where V is set of all the target words. </div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l39">Line 39:</td>
<td colspan="2" class="diff-lineno">Line 39:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><math>\bigtriangledown=<del style="font-weight: bold; text-decoration: none;">logp</del>(y_t|Y_{<t}, x_t)=\bigtriangledown \mathbf\varepsilon(y_t)-\sum_{k:y_k\in V} p(y_k|y_{<t}, x) \bigtriangledown \mathbf\varepsilon(y_t) </math></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><math>\bigtriangledown=<ins style="font-weight: bold; text-decoration: none;">\log p</ins>(y_t|Y_{<t}, x_t)=\bigtriangledown \mathbf\varepsilon(y_t)-\sum_{k:y_k\in V} p(y_k|y_{<t}, x) \bigtriangledown \mathbf\varepsilon(y_t) </math></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>where the energy <math>\mathbf\varepsilon</math> is defined as <math>\mathbf\varepsilon(y_i)=W_j^T\phi(y_{j-1}, Z_j, C_j)+b_j</math>. The second term of gradiant is in essence the expected gradiant of the energy as <math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)</math>. </div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>where the energy <math>\mathbf\varepsilon</math> is defined as <math>\mathbf\varepsilon(y_i)=W_j^T\phi(y_{j-1}, Z_j, C_j)+b_j</math>. The second term of gradiant is in essence the expected gradiant of the energy as <math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)</math>. </div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The idea of the proposed approach is to approximate this expectation of the gradient by importance sampling with a small number of samples. Given a predefined proposal distribution Q and a set <math>v\prime</math> of samples from Q, we approximate the expectation with </div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The idea of the proposed approach is to approximate this expectation of the gradient by importance sampling with a small number of samples. Given a predefined proposal distribution Q and a set <math>v\prime</math> of samples from Q, we approximate the expectation with </div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)\approx \sum_{k:y_k\in V\prime} \frac{w_k}{\sum_{k\prime:y_k\prime\in V\prime}w_k\prime}\epsilon(y_k)</math> where <math>w_k=exp{\epsilon(y_k)-log Q(y_k)}</math></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><math>\mathbb E_P[\bigtriangledown \epsilon(y)]</math> where P denotes <math>p(y|y_{<t}, x)\approx \sum_{k:y_k\in V\prime} \frac{w_k}{\sum_{k\prime:y_k\prime\in V\prime}w_k\prime}\epsilon(y_k)</math> where <math><ins style="font-weight: bold; text-decoration: none;">\,</ins>w_k=exp{\epsilon(y_k)-log Q(y_k)}</math></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In practice, the training corpus is partitioned and a subset <math>v\prime</math> of the target vocabulary is defined for each partition prior to training. Before training begins, each target sentence in the training corpus is sequentially examined and accumulate unique target words until the number of unique target</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In practice, the training corpus is partitioned and a subset <math>v\prime</math> of the target vocabulary is defined for each partition prior to training. Before training begins, each target sentence in the training corpus is sequentially examined and accumulate unique target words until the number of unique target</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>words reaches the predefined threshold τ . The accumulated vocabulary will be used for this partition of the corpus during training. This processes is repeated until the end of the training set is reached. </div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>words reaches the predefined threshold τ . The accumulated vocabulary will be used for this partition of the corpus during training. This processes is repeated until the end of the training set is reached. </div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">0</del></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In this approach the alignments between the target words and source locations via the alignment model is obtained. This is useful when the model generated an Un token. Once a translation is generated given a source sentence, each Un may be replaced using a translation-specific technique based on the aligned source word. The authors in the experiment, replaced each ''Un'' token with the aligned source word or its most likely translation determined by another word alignment model.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In this approach the alignments between the target words and source locations via the alignment model is obtained. This is useful when the model generated an Un token. Once a translation is generated given a source sentence, each Un may be replaced using a translation-specific technique based on the aligned source word. The authors in the experiment, replaced each ''Un'' token with the aligned source word or its most likely translation determined by another word alignment model.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The proposed approach was evaluated in English->French and English-German translation. The neural machine translation model was trained by the bilingual, parallel corpora made available as part of WMT’14. The data sets were used for English to French were European v7, Common Crawl, UN, News Commentary, Gigaword. The data sets for English-German were Europarl v7, Common Crawl, News Commentary. </div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The proposed approach was evaluated in English->French and English-German translation. The neural machine translation model was trained by the bilingual, parallel corpora made available as part of WMT’14. The data sets were used for English to French were European v7, Common Crawl, UN, News Commentary, Gigaword. The data sets for English-German were Europarl v7, Common Crawl, News Commentary. </div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The models were evaluated on the WMT’14 test set (news-test 2014)3 , while the concatenation of news-test-2012 and news-test-2013 is used for model selection (development set). Table 1 presents data coverage w.r.t. the vocabulary size, on the target side.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The models were evaluated on the WMT’14 test set (news-test 2014)3 , while the concatenation of news-test-2012 and news-test-2013 is used for model selection (development set). Table 1 presents data coverage w.r.t. the vocabulary size, on the target side.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
</table>Mhpanjuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=27052&oldid=prevPblouw at 07:19, 3 December 20152015-12-03T07:19:53Z<p></p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="us">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 03:19, 3 December 2015</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l3">Line 3:</td>
<td colspan="2" class="diff-lineno">Line 3:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation"</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation"</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><ref>S. Jean, K. Cho, R Memisevic, and Y. Bengio. [http://arxiv.org/pdf/1412.2007v2.pdf "On Using Very Large Target Vocabulary for Neural Machine Translation"], 2015.</ref></div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><ref>S. Jean, K. Cho, R Memisevic, and Y. Bengio. [http://arxiv.org/pdf/1412.2007v2.pdf "On Using Very Large Target Vocabulary for Neural Machine Translation"], 2015.</ref></div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The paper presents the application of importance sampling for neural machine translation with very large target vocabulary. Despite the advantages of neural networks in translation over the statistical machine translation systems, such as the phrase-based system, they suffer from some technical problems. Most importantly they are limited to work with a small vocabulary <del style="font-weight: bold; text-decoration: none;">data set </del>because of complexity and number of parameters have to be trained. <del style="font-weight: bold; text-decoration: none;">The performance </del>of the <del style="font-weight: bold; text-decoration: none;">current neural nets are rapidly decrease if </del>the number of <del style="font-weight: bold; text-decoration: none;">unidentified </del>words in this target vocabulary <del style="font-weight: bold; text-decoration: none;">increase</del>. In this paper Jean and his colleagues <del style="font-weight: bold; text-decoration: none;">proposed </del>a method <del style="font-weight: bold; text-decoration: none;">of training </del>based on <del style="font-weight: bold; text-decoration: none;">the </del>importance sampling which <del style="font-weight: bold; text-decoration: none;">can </del>uses a large target vocabulary without increasing training complexity. The proposed algorithm <del style="font-weight: bold; text-decoration: none;">demonstrate </del>better performance without losing <del style="font-weight: bold; text-decoration: none;">the </del>efficiency in time or speed.</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The paper presents the application of importance sampling for neural machine translation with <ins style="font-weight: bold; text-decoration: none;">a </ins>very large target vocabulary. Despite the advantages of neural networks in translation over the statistical machine translation systems, such as the phrase-based system, they suffer from some technical problems. Most importantly<ins style="font-weight: bold; text-decoration: none;">, </ins>they are limited to work with a small vocabulary because of complexity and <ins style="font-weight: bold; text-decoration: none;">the </ins>number of parameters <ins style="font-weight: bold; text-decoration: none;">that </ins>have to be trained. <ins style="font-weight: bold; text-decoration: none;">To explain, the output layer of an RNN used for machine translation will have as many units as there are items in the vocabulary. If the vocabulary has hundreds of thousand of terms, then the RNN must compute a very expensive softmax on the output units at each time step when predicting an output sequence. Moreover, the number of parameters in the RNN will also grow very large in such cases, given that number </ins>of <ins style="font-weight: bold; text-decoration: none;">weights between </ins>the <ins style="font-weight: bold; text-decoration: none;">hidden layer and output layer will be equal to the product of </ins>the number of <ins style="font-weight: bold; text-decoration: none;">units in each layer. For a non-trivially sized hidden layer, a large vocabulary could result in tens of millions of model parameters just associated with the hidden-to-output mapping performed by the model. In practice, researchers who apply RNNs to machine translation have avoided this problem by restricting the model vocabulary to only include some shortlist of </ins>words <ins style="font-weight: bold; text-decoration: none;">in the target language. Words not </ins>in this <ins style="font-weight: bold; text-decoration: none;">shortlist are treated as unknown by the model and assigned a special 'UNK' token. This technique understandably impairs translation performance when the </ins>target <ins style="font-weight: bold; text-decoration: none;">sentence includes a large number of words not present in the </ins>vocabulary. </div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>In this paper Jean and his colleagues <ins style="font-weight: bold; text-decoration: none;">aim to solve this problem by proposing </ins>a <ins style="font-weight: bold; text-decoration: none;">training </ins>method based on importance sampling which uses a large target vocabulary without increasing training complexity. The proposed algorithm <ins style="font-weight: bold; text-decoration: none;">demonstrates </ins>better performance without losing efficiency in time or speed.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>==Methods==</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>==Methods==</div></td></tr>
</table>Pblouwhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=26710&oldid=prevMgohari2 at 15:33, 20 November 20152015-11-20T15:33:16Z<p></p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="us">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 11:33, 20 November 2015</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l16">Line 16:</td>
<td colspan="2" class="diff-lineno">Line 16:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Bahdanau et al.,[http://arxiv.org/pdf/1409.0473v6.pdf NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE], 2014</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Bahdanau et al.,[http://arxiv.org/pdf/1409.0473v6.pdf NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE], 2014</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div></ref>.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div></ref>.</div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>In that the encoder is implemented by a bi-directional recurrent neural network,<math>h_t=[<del style="font-weight: bold; text-decoration: none;">h_i</del>^\leftarrow; h_t^\rightarrow]</math>. The decoder, at each time, computes the context</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>In that the encoder is implemented by a bi-directional recurrent neural network,<math>h_t=[<ins style="font-weight: bold; text-decoration: none;">h_t</ins>^\leftarrow; h_t^\rightarrow]</math>. The decoder, at each time, computes the context</div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>vector <del style="font-weight: bold; text-decoration: none;">ct </del>as a convex sum of the hidden states (<del style="font-weight: bold; text-decoration: none;">h1</del>, . . . , <del style="font-weight: bold; text-decoration: none;">hT </del>) with the coefficients (<del style="font-weight: bold; text-decoration: none;">α1</del>, . . . , <del style="font-weight: bold; text-decoration: none;">αT</del>) computed by</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>vector <ins style="font-weight: bold; text-decoration: none;"><math>c_t</math> </ins>as a convex sum of the hidden states <ins style="font-weight: bold; text-decoration: none;"><math></ins>(<ins style="font-weight: bold; text-decoration: none;">h_1</ins>,...,<ins style="font-weight: bold; text-decoration: none;">h_T</ins>)<ins style="font-weight: bold; text-decoration: none;"></math> </ins>with the coefficients <ins style="font-weight: bold; text-decoration: none;"><math></ins>(<ins style="font-weight: bold; text-decoration: none;">\alpha_1</ins>,...,<ins style="font-weight: bold; text-decoration: none;">\alpha_T</ins>)<ins style="font-weight: bold; text-decoration: none;"></math> </ins>computed by</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><math>\alpha_t=\frac{exp\{a(h_t, z_t)\}}{\sum_{k}exp\{a(h_t, z_t)\}}</math></div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><math>\alpha_t=\frac{exp\{a(h_t, z_t)\}}{\sum_{k}exp\{a(h_t, z_t)\}}</math></div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l159">Line 159:</td>
<td colspan="2" class="diff-lineno">Line 159:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The influence of the target vocabulary when translating the test sentences by using the union of a fixed set of 30, 000 common words and (at most) K0 likely candidates for each source word was evaluated for English→French with size of 30, 000. The performance of the system is comparable to the baseline when Uns not replaced, but there is not as much improvement when doing so.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The influence of the target vocabulary when translating the test sentences by using the union of a fixed set of 30, 000 common words and (at most) K0 likely candidates for each source word was evaluated for English→French with size of 30, 000. The performance of the system is comparable to the baseline when Uns not replaced, but there is not as much improvement when doing so.</div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The authors found that K is inversely correlated with t. <del style="font-weight: bold; text-decoration: none;">The single-model test BLEU scores for English→French with respect to the number of dictionary entries <math>k\prime</math> allowed for each source word is presented in figure below</del></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The authors found that K is inversely correlated with t. </div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
</table>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=26709&oldid=prevMgohari2 at 15:23, 20 November 20152015-11-20T15:23:55Z<p></p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="us">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 11:23, 20 November 2015</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l137">Line 137:</td>
<td colspan="2" class="diff-lineno">Line 137:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>For English→German, the RNNsearch-LV outperformed the baseline before unknown word replacement, but after doing so, the two systems performed similarly. A higher large vocabulary single-model performance is achieved by reshuffling the dataset. In this case, we were able to surpass the previously reported best translation result on this task by building an ensemble of 8 models. With τ = 15, 000, the RNNsearch-LV performance worsened a little, with best BLEU scores, without reshuffling, of 33.76 and 18.59 respectively for English→French and English→German.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>For English→German, the RNNsearch-LV outperformed the baseline before unknown word replacement, but after doing so, the two systems performed similarly. A higher large vocabulary single-model performance is achieved by reshuffling the dataset. In this case, we were able to surpass the previously reported best translation result on this task by building an ensemble of 8 models. With τ = 15, 000, the RNNsearch-LV performance worsened a little, with best BLEU scores, without reshuffling, of 33.76 and 18.59 respectively for English→French and English→German.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The timing information of decoding for different models were presented in Table <del style="font-weight: bold; text-decoration: none;">3</del>. While decoding from RNNsearch-LV with the full target vocabulary is slowest, the speed substantially improves if a candidate list for decoding each translation is used. </div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The timing information of decoding for different models were presented in Table <ins style="font-weight: bold; text-decoration: none;">below</ins>. While decoding from RNNsearch-LV with the full target vocabulary is slowest, the speed substantially improves if a candidate list for decoding each translation is used. </div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">{| class="wikitable"</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">|-</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">! Method </ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">! CPU i7-4820k</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">! GPU GTX TITAN black</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">|-</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">| RNNsearch</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">| 0.09 s</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">| 0.02 s</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">|-</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">| RNNsearch-LV </ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">| 0.80 s</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">| 0.25 s</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">|-</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">| RNNsearch-LV</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">+Candidate list</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">| 0.12 s</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">| 0.0.05 s</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">|}</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The influence of the target vocabulary when translating the test sentences by using the union of a fixed set of 30, 000 common words and (at most) K0 likely candidates for each source word was evaluated for English→French with size of 30, 000. The performance of the system is comparable to the baseline when Uns not replaced, but there is not as much improvement when doing so.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The influence of the target vocabulary when translating the test sentences by using the union of a fixed set of 30, 000 common words and (at most) K0 likely candidates for each source word was evaluated for English→French with size of 30, 000. The performance of the system is comparable to the baseline when Uns not replaced, but there is not as much improvement when doing so.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The authors found that K is inversely correlated with t. The single-model test BLEU scores for English→French with respect to the number of dictionary entries <math>k\prime</math> allowed for each source word is presented in figure below</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The authors found that K is inversely correlated with t. The single-model test BLEU scores for English→French with respect to the number of dictionary entries <math>k\prime</math> allowed for each source word is presented in figure below</div></td></tr>
</table>Mgohari2http://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=26680&oldid=prevMhpanju at 01:25, 20 November 20152015-11-20T01:25:19Z<p></p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="us">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 21:25, 19 November 2015</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l1">Line 1:</td>
<td colspan="2" class="diff-lineno">Line 1:</td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">'''</del>Overview<del style="font-weight: bold; text-decoration: none;">'''</del></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">==</ins>Overview<ins style="font-weight: bold; text-decoration: none;">==</ins></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation"</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation"</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l5">Line 5:</td>
<td colspan="2" class="diff-lineno">Line 5:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The paper presents the application of importance sampling for neural machine translation with very large target vocabulary. Despite the advantages of neural networks in translation over the statistical machine translation systems, such as the phrase-based system, they suffer from some technical problems. Most importantly they are limited to work with a small vocabulary data set because of complexity and number of parameters have to be trained. The performance of the current neural nets are rapidly decrease if the number of unidentified words in this target vocabulary increase. In this paper Jean and his colleagues proposed a method of training based on the importance sampling which can uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrate better performance without losing the efficiency in time or speed.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The paper presents the application of importance sampling for neural machine translation with very large target vocabulary. Despite the advantages of neural networks in translation over the statistical machine translation systems, such as the phrase-based system, they suffer from some technical problems. Most importantly they are limited to work with a small vocabulary data set because of complexity and number of parameters have to be trained. The performance of the current neural nets are rapidly decrease if the number of unidentified words in this target vocabulary increase. In this paper Jean and his colleagues proposed a method of training based on the importance sampling which can uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrate better performance without losing the efficiency in time or speed.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">'''</del>Methods<del style="font-weight: bold; text-decoration: none;">'''</del></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">==</ins>Methods<ins style="font-weight: bold; text-decoration: none;">==</ins></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Recall that the classic neural machine learning plays as encoder-decoder network. The encoder reads the source sentence x and encode it into a sequence of hidden states of h where <math>h_t=f(x_t,h_{t-1})</math>. In the decoder step, another neural network generates the translation vector of y based on the encoded sequence of hidden states h: <math>p(y_t\,|\,y_{<t},x)\propto exp\{q(y_{t-1}, z_t, c_t)\}</math> where <math>z_t=g(y_{t-1}, z_{t-1}, c_t)</math> and <math>c_t=r(z_{t-1}, h_1,..., H_T)</math></div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Recall that the classic neural machine learning plays as encoder-decoder network. The encoder reads the source sentence x and encode it into a sequence of hidden states of h where <math>h_t=f(x_t,h_{t-1})</math>. In the decoder step, another neural network generates the translation vector of y based on the encoded sequence of hidden states h: <math>p(y_t\,|\,y_{<t},x)\propto exp\{q(y_{t-1}, z_t, c_t)\}</math> where <math>z_t=g(y_{t-1}, z_{t-1}, c_t)</math> and <math>c_t=r(z_{t-1}, h_1,..., H_T)</math></div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l50">Line 50:</td>
<td colspan="2" class="diff-lineno">Line 50:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The models were evaluated on the WMT’14 test set (news-test 2014)3 , while the concatenation of news-test-2012 and news-test-2013 is used for model selection (development set). Table 1 presents data coverage w.r.t. the vocabulary size, on the target side.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The models were evaluated on the WMT’14 test set (news-test 2014)3 , while the concatenation of news-test-2012 and news-test-2013 is used for model selection (development set). Table 1 presents data coverage w.r.t. the vocabulary size, on the target side.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">'''</del>Setting<del style="font-weight: bold; text-decoration: none;">'''</del></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">==</ins>Setting<ins style="font-weight: bold; text-decoration: none;">==</ins></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>As a baseline for English→French translation, the authors used the RNNsearch model proposed by (Bahdanau et al., 2014), with 30,000 source and target words and also another RNNsearch was trained for English→German translation with 50,000 source and target words. Using the proposed approach another set of RNNsearch models with much larger vocabularies of 500,000 source and target words was trained for each language pair. Different shortlist sizes used during training: 15,000 and 30,000 for English→French, and 15,000 and 50,000 for English→German. The best performance on the development set were evaluated and reported every twelve hours. For both language pairs, new models were trained with shortlist size of 15, 000 and 50, 000 by reshuffling the dataset at the beginning of each epoch. While this causes a non-negligible amount of overhead, such a change allows words to be contrasted with different sets of other words each epoch. The beam search was used to generate a translation given a source. The authors keep a set of 12 hypotheses and normalize probabilities by the length of the candidate sentences which was chosen to maximize the performance on the development set, for K ∈ {15k, 30k, 50k} and K0 ∈ {10, 20}. They test using a bilingual dictionary to accelerate decoding and to replace unknown words in translations.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>As a baseline for English→French translation, the authors used the RNNsearch model proposed by (Bahdanau et al., 2014), with 30,000 source and target words and also another RNNsearch was trained for English→German translation with 50,000 source and target words. Using the proposed approach another set of RNNsearch models with much larger vocabularies of 500,000 source and target words was trained for each language pair. Different shortlist sizes used during training: 15,000 and 30,000 for English→French, and 15,000 and 50,000 for English→German. The best performance on the development set were evaluated and reported every twelve hours. For both language pairs, new models were trained with shortlist size of 15, 000 and 50, 000 by reshuffling the dataset at the beginning of each epoch. While this causes a non-negligible amount of overhead, such a change allows words to be contrasted with different sets of other words each epoch. The beam search was used to generate a translation given a source. The authors keep a set of 12 hypotheses and normalize probabilities by the length of the candidate sentences which was chosen to maximize the performance on the development set, for K ∈ {15k, 30k, 50k} and K0 ∈ {10, 20}. They test using a bilingual dictionary to accelerate decoding and to replace unknown words in translations.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">'''</del>Results<del style="font-weight: bold; text-decoration: none;">'''</del></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">==</ins>Results<ins style="font-weight: bold; text-decoration: none;">==</ins></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The results for English-> French translation obtained by the trained models with very large target vocabularies compared with results of previous models reported in Table below.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The results for English-> French translation obtained by the trained models with very large target vocabularies compared with results of previous models reported in Table below.</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l142">Line 142:</td>
<td colspan="2" class="diff-lineno">Line 142:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">'''</del>Conclusion<del style="font-weight: bold; text-decoration: none;">'''</del></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">==</ins>Conclusion<ins style="font-weight: bold; text-decoration: none;">==</ins></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Using the importance sampling an approach was proposed to be used in machine translation with a large target vocabulary without any substantial increase in computational complexity. The BLUE values for the proposed model showed translation performance comparable to the state-of-the-art translation systems on both the English→French task and English→German task.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Using the importance sampling an approach was proposed to be used in machine translation with a large target vocabulary without any substantial increase in computational complexity. The BLUE values for the proposed model showed translation performance comparable to the state-of-the-art translation systems on both the English→French task and English→German task.</div></td></tr>
</table>Mhpanjuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_using_very_large_target_vocabulary_for_neural_machine_translation&diff=26679&oldid=prevMgohari2 at 01:07, 20 November 20152015-11-20T01:07:36Z<p></p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="us">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 21:07, 19 November 2015</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l1">Line 1:</td>
<td colspan="2" class="diff-lineno">Line 1:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>'''Overview'''</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>'''Overview'''</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation"<ref></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation"</div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>S. Jean, K. Cho, R Memisevic, and Y. Bengio<del style="font-weight: bold; text-decoration: none;">, </del>"On Using Very Large Target Vocabulary for Neural Machine Translation", </div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ref>S. Jean, K. Cho, R Memisevic, and Y. Bengio<ins style="font-weight: bold; text-decoration: none;">. [http://arxiv.org/pdf/1412.2007v2.pdf </ins>"On Using Very Large Target Vocabulary for Neural Machine Translation"<ins style="font-weight: bold; text-decoration: none;">]</ins>, <ins style="font-weight: bold; text-decoration: none;">2015.</ins></ref></div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div></ref><del style="font-weight: bold; text-decoration: none;">. </del>The paper presents the application of importance sampling for neural machine translation with very large target vocabulary. Despite the advantages of neural networks in translation over the statistical machine translation systems, such as the phrase-based system, they suffer from some technical problems. Most importantly they are limited to work with a small vocabulary data set because of complexity and number of parameters have to be trained. The performance of the current neural nets are rapidly decrease if the number of unidentified words in this target vocabulary increase. In this paper Jean and his colleagues proposed a method of training based on the importance sampling which can uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrate better performance without losing the efficiency in time or speed.</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The paper presents the application of importance sampling for neural machine translation with very large target vocabulary. Despite the advantages of neural networks in translation over the statistical machine translation systems, such as the phrase-based system, they suffer from some technical problems. Most importantly they are limited to work with a small vocabulary data set because of complexity and number of parameters have to be trained. The performance of the current neural nets are rapidly decrease if the number of unidentified words in this target vocabulary increase. In this paper Jean and his colleagues proposed a method of training based on the importance sampling which can uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrate better performance without losing the efficiency in time or speed.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>'''Methods'''</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>'''Methods'''</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l14">Line 14:</td>
<td colspan="2" class="diff-lineno">Line 14:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>where <math>(x^n, y^n)</math> is the n-th training pair of sentence, and <math>T_n</math> is the length of n-th target sentence <math>y^n</math>.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>where <math>(x^n, y^n)</math> is the n-th training pair of sentence, and <math>T_n</math> is the length of n-th target sentence <math>y^n</math>.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The proposed model is based on specific implementation of neural machine translation that uses an attention mechanism, as recently proposed in <ref></div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The proposed model is based on specific implementation of neural machine translation that uses an attention mechanism, as recently proposed in <ref></div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Bahdanau et</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Bahdanau et al.<ins style="font-weight: bold; text-decoration: none;">,[http://arxiv.org/pdf/1409.0473v6.pdf NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE]</ins>, 2014</div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>al., 2014</div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div></ref>.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div></ref>.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In that the encoder is implemented by a bi-directional recurrent neural network,<math>h_t=[h_i^\leftarrow; h_t^\rightarrow]</math>. The decoder, at each time, computes the context</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In that the encoder is implemented by a bi-directional recurrent neural network,<math>h_t=[h_i^\leftarrow; h_t^\rightarrow]</math>. The decoder, at each time, computes the context</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l33">Line 33:</td>
<td colspan="2" class="diff-lineno">Line 32:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The approach of this paper uses only a subset of sampled target words as a align vector to maximize Eq (6), instead of all the likely target words. The most naïve way to select a subset of target words is selection of K most frequent words. However, This skipping words from training processes is in contrast with using a large vocabulary, because practically we removed a bunch of words from target dictionary. Jean et al., proposed using an existing word alignment model to align the source and target words in the training corpus and build a dictionary. With the dictionary, for each source sentence, we construct a target word set consisting of the K-most frequent words (according to the estimated unigram probability) and, using the dictionary, at most <math>k\prime</math> likely target words for each source word. K and <math>k\prime</math> may be chosen either to meet the computational requirement or to maximize the translation performance on the development set. </div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The approach of this paper uses only a subset of sampled target words as a align vector to maximize Eq (6), instead of all the likely target words. The most naïve way to select a subset of target words is selection of K most frequent words. However, This skipping words from training processes is in contrast with using a large vocabulary, because practically we removed a bunch of words from target dictionary. Jean et al., proposed using an existing word alignment model to align the source and target words in the training corpus and build a dictionary. With the dictionary, for each source sentence, we construct a target word set consisting of the K-most frequent words (according to the estimated unigram probability) and, using the dictionary, at most <math>k\prime</math> likely target words for each source word. K and <math>k\prime</math> may be chosen either to meet the computational requirement or to maximize the translation performance on the development set. </div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In order to avoid the growing complexity of computing the normalization constant, the authors proposed to use only a small subset <math>v\prime</math> of the target vocabulary at each update<ref></div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In order to avoid the growing complexity of computing the normalization constant, the authors proposed to use only a small subset <math>v\prime</math> of the target vocabulary at each update<ref></div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Bengio and Sen´ <del style="font-weight: bold; text-decoration: none;">ecal</del>, 2008</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Bengio and Sen´ <ins style="font-weight: bold; text-decoration: none;">et al, [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4443871.pdf Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model ],IEEEXplor</ins>, 2008</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div></ref>. </div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div></ref>. </div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Let us consider the gradient of the log probability of the output in conditional probability of <math>y_t</math>. The gradient is composed of a positive and negative part:</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Let us consider the gradient of the log probability of the output in conditional probability of <math>y_t</math>. The gradient is composed of a positive and negative part:</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l147">Line 147:</td>
<td colspan="2" class="diff-lineno">Line 146:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Using the importance sampling an approach was proposed to be used in machine translation with a large target vocabulary without any substantial increase in computational complexity. The BLUE values for the proposed model showed translation performance comparable to the state-of-the-art translation systems on both the English→French task and English→German task.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Using the importance sampling an approach was proposed to be used in machine translation with a large target vocabulary without any substantial increase in computational complexity. The BLUE values for the proposed model showed translation performance comparable to the state-of-the-art translation systems on both the English→French task and English→German task.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>On English→French and English→German translation tasks, the neural machine translation models trained using the proposed method performed as well as, or better than, those using only limited sets of target words, even when replacing unknown words.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>On English→French and English→German translation tasks, the neural machine translation models trained using the proposed method performed as well as, or better than, those using only limited sets of target words, even when replacing unknown words.</div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">== Bibliography ==</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"><references /></ins></div></td></tr>
</table>Mgohari2