Difference between revisions of "stat946w18/Synthetic and natural noise both break neural machine translation"

From statwiki
Jump to: navigation, search
(Introduction)
(Conclusion)
 
(43 intermediate revisions by 20 users not shown)
Line 1: Line 1:
 
== Introduction ==
 
== Introduction ==
* Humans have surprisingly robust language processing systems which can easily overcome typos, e.g.
+
Humans have surprisingly robust language processing systems which can easily overcome disordered words, like the following example illustrated, an human reader may recognize the meaning of the following sentence with not much difficulty,
 
    
 
    
Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae.
+
* "Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae."
 
    
 
    
* A person's ability to read this text comes as no surprise to the Psychology literature
+
A person's ability to read this text comes as no surprise to the Psychology literature
*# Saberi & Perrott (1999) found that this robustness extends to audio as well.
+
# Saberi & Perrott (1999) found that this robustness extends to audio as well.
*# Rayner et al. (2006) found that in noisier settings reading comprehension only slowed by 11 \%.
+
# Rayner et al. (2006) found that in noisier settings reading comprehension only slowed by 11%.
*# McCusker et al. (1981) found that the common case of swapping letters could often go unnoticed by the reader.
+
# McCusker et al. (1981) found that the common case of swapping letters could often go unnoticed by the reader.
*# Mayall et al (1997) shows that we rely on word shape.
+
# Mayall et al (1997) shows that we rely on word shape.
*# Reicher, 1969; Pelli et al., (2003) found that we can switch between whole word recognition but the first and last letter positions are required to stay constant for comprehension
+
# Reicher, 1969; Pelli et al., (2003) found that we can switch between whole word recognition but the first and last letter positions are required to stay constant for comprehension
  
However, NMT(neural machine translation) systems are brittle. i.e. The Arabic word
+
However, neural machine translation (NMT) systems are brittle. i.e. The Arabic word
 
[[File:Good_morning.PNG]] means a blessing for good morning, however [[File:Hunt.PNG]] means hunt or slaughter.  
 
[[File:Good_morning.PNG]] means a blessing for good morning, however [[File:Hunt.PNG]] means hunt or slaughter.  
  
 
Facebook's MT system mistakenly confused two words that only differ by one character, a situation that is challenging for a character-based NMT system.
 
Facebook's MT system mistakenly confused two words that only differ by one character, a situation that is challenging for a character-based NMT system.
  
Figure 1 shows the performance translating German to English as a function of the percent of German words modified. Here we show two types of noise: (1) Random permutation of the word and (2) Swapping a pair of adjacent letters that does not include the first or last letter of the word. The important thing to note is that even small amounts of noise lead to substantial drops in performance.
+
The figure below shows the performance translating German to English as a function of the percent of German words modified. Here two types of noise are shown: (1) In blue, random permutation of the word and (2) In green, swapping a pair of adjacent letters that does not include the first or last letter of the word. The important thing to note is that even small amounts of noise lead to substantial drops in performance.
  
[[File:BLEU_plot.PNG]]  
+
[[File:BLEU_plot.PNG|center]]  
  
BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is". BLEU  is between 0 and 1.
+
BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is". BLEU  is between 0 and 1. BELU computes the scores for individual translated segments and then computes an average accuracy score for the whole corpus.
  
 
This paper explores two simple strategies for increasing model robustness:
 
This paper explores two simple strategies for increasing model robustness:
# using structure-invariant representations ( character CNN representation)
+
# using structure-invariant representations (character CNN representation)
 
# robust training on noisy data, a form of adversarial training.
 
# robust training on noisy data, a form of adversarial training.
 +
 +
The goal of the paper is two-fold:
 +
# to initiate a conversation on robust training and modeling techniques in NMT
 +
# to  promote the creation of better and more linguistically accurate artificial noise to be applied to new languages and tasks
  
 
== Adversarial examples ==
 
== Adversarial examples ==
The growing literature on adversarial examples has demonstrated how dangerous it can be to have brittle machine learning systems being used so pervasively in the real world.
+
The growing literature on adversarial examples has demonstrated how dangerous it can be to have brittle machine learning systems being used so pervasively in the real world. Small changes to the input can lead to dramatic
 +
failures of deep learning models. This leads to a potential for malicious attacks using adversarial examples. An important distinction is often drawn between white-box attacks, where adversarial examples are generated with
 +
access to the model parameters, and black-box attacks, where examples are generated without such access.
  
 
The paper devises simple methods for generating adversarial examples for NMT. They do not assume any access to the NMT models' gradients, instead relying on cognitively-informed and naturally occurring language errors to generate noise.
 
The paper devises simple methods for generating adversarial examples for NMT. They do not assume any access to the NMT models' gradients, instead relying on cognitively-informed and naturally occurring language errors to generate noise.
  
 
== MT system ==
 
== MT system ==
We experiment with three different NMT systems with access to character information at different levels.
+
The authors experiment with three different NMT systems with access to character information at different levels.
 
# Use <code>char2char</code>, the fully character-level model of (Lee et al. 2017).  This model processes a sentence as a sequence of characters. The encoder works as follows: the characters are embedded as vectors, and then the sequence of vectors is fed to a convolutional layer.  The sequence output by the convolutional layer is then shortened by max pooling in the time dimension.  The output of the max-pooling layer is then fed to a four-layer highway network (Srivasta et al. 2015), and the output of the highway network is in turn fed to a bidirectional GRU, producing a sequence of hidden units. The sequence of hidden units is then processed by the decoder, a GRU with attention, to produce probabilities over sequences of output characters.
 
# Use <code>char2char</code>, the fully character-level model of (Lee et al. 2017).  This model processes a sentence as a sequence of characters. The encoder works as follows: the characters are embedded as vectors, and then the sequence of vectors is fed to a convolutional layer.  The sequence output by the convolutional layer is then shortened by max pooling in the time dimension.  The output of the max-pooling layer is then fed to a four-layer highway network (Srivasta et al. 2015), and the output of the highway network is in turn fed to a bidirectional GRU, producing a sequence of hidden units. The sequence of hidden units is then processed by the decoder, a GRU with attention, to produce probabilities over sequences of output characters.
 
#  Use <code>Nematus</code> (Sennrich et al., 2017), a popular NMT toolkit. It is another sequence-to-sequence model with several architecture modifications, especially operating on sub-word units using byte-pair encoding. Byte-pair encoding (Sennich et al. 2015, Gage 1994) is an algorithm according to which we begin with a list of characters as our symbols, and repeatedly fuse common combinations to create new symbols.  For example, if we begin with the letters a to z as our symbol list, and we find that  "th" is the most common two-letter combination in a corpus, then we would add "th" to our symbol list in the first iteration. After we have used this algorithm to create a symbol list of the desired size, we apply a standard encoder-decoder with attention.
 
#  Use <code>Nematus</code> (Sennrich et al., 2017), a popular NMT toolkit. It is another sequence-to-sequence model with several architecture modifications, especially operating on sub-word units using byte-pair encoding. Byte-pair encoding (Sennich et al. 2015, Gage 1994) is an algorithm according to which we begin with a list of characters as our symbols, and repeatedly fuse common combinations to create new symbols.  For example, if we begin with the letters a to z as our symbol list, and we find that  "th" is the most common two-letter combination in a corpus, then we would add "th" to our symbol list in the first iteration. After we have used this algorithm to create a symbol list of the desired size, we apply a standard encoder-decoder with attention.
 
# Use an attentional sequence-to-sequence model with a word representation based on a character convolutional neural network (<code>charCNN</code>). The <code>charCNN</code> model  is similar to <code>char2char</code>, but uses a shallower highway network and, although it reads the input sentence as characters, it produces as output a probability distribution over words, not characters.
 
# Use an attentional sequence-to-sequence model with a word representation based on a character convolutional neural network (<code>charCNN</code>). The <code>charCNN</code> model  is similar to <code>char2char</code>, but uses a shallower highway network and, although it reads the input sentence as characters, it produces as output a probability distribution over words, not characters.
  
== DATA ==
+
== Data ==
=== MY DATA ===
+
=== MT Data ===
We use the TED talks parallel corpus prepared for IWSLT 2016 (Cettolo et al., 2012) for testing all of the NMT systems.
+
The authors use the TED talks parallel corpus prepared for IWSLT 2016 (Cettolo et al., 2012) for testing all of the NMT systems.
[[File:Table1x.PNG]]
+
 
 +
[[File:Table1x.PNG|center]]
 +
 
 +
=== Natural and Artificial Noise ===
 +
==== Natural Noise ====
 +
The three languages, French, German, and Czech, each have their own frequent natural errors. The corpora of edits used for these languages are:
 +
 
 +
# French : Wikipedia Correction and Paraphrase Corpus (WiCoPaCo)
 +
# German : RWSE Wikipedia Correction Dataset and The MERLIN corpus
 +
# Czech : CzeSL Grammatical Error Correction Dataset (CzeSL-GEC) which is a manually annotated dataset of essays written by both non-native learners of Czech and Czech pupils
  
=== NATURAL AND ARTIFICIAL NOISE ===
+
The authors harvested naturally occurring errors (typos, misspellings, etc.) corresponding to these three languages from available corpora of edits to build a look-up table of possible lexical replacements.
==== NATURAL NOISE ====
 
To three different languages French, German and Czech, they have their own frequent natural errors.  
 
  
The author harvest naturally occurring errors (typos, misspellings, etc.) corresponding to these three languages from available corpora of edits to build a look-up table of possible lexical replacements.
+
They insert these errors into the source-side of the parallel data by replacing every word in the corpus with an error if one exists in our dataset. When there is more than one possible replacement to choose, words for which there is no error, are sampled uniformly and kept as is.
  
 
==== Synthetic Noise ====
 
==== Synthetic Noise ====
In addition to naturally collected sources of error, we also experiment with four types of synthetic noise: Swap, Middle Random, Fully Random, and Key Typo.
+
In addition to naturally collected sources of error, the authors also experiment with four types of synthetic noise: Swap, Middle Random, Fully Random, and Key Typo.
# <code>Swap</code>: The first and simplest source of noise is swapping two letters (do not alter the first or last letters).
+
# <code>Swap</code>: The first and simplest source of noise is swapping two letters (do not alter the first or last letters, only apply to words of length >=4).
# <code>Middle Random</code>: Randomize the order of all the letters in a word except for the first and last.
+
# <code>Middle Random</code>: Randomize the order of all the letters in a word except for the first and last (only apply to words of length >=4).
 
# <code>Fully Random</code> Completely randomized words.
 
# <code>Fully Random</code> Completely randomized words.
 
# <code>Keyboard Typo</code> Randomly replace one letter in each word with an adjacent key
 
# <code>Keyboard Typo</code> Randomly replace one letter in each word with an adjacent key
  
[[File:Table3x.PNG]]
+
[[File:Table3x.PNG|center]]
  
 
Table 3 shows BLEU scores of models trained on clean (Vanilla) texts and tested on clean and noisy
 
Table 3 shows BLEU scores of models trained on clean (Vanilla) texts and tested on clean and noisy
Line 61: Line 74:
 
for both natural noise and all kinds of synthetic noise. The more noise in the text, the worse the
 
for both natural noise and all kinds of synthetic noise. The more noise in the text, the worse the
 
translation quality, with random scrambling producing the lowest BLEU scores.
 
translation quality, with random scrambling producing the lowest BLEU scores.
 +
 +
In contrast to the poor performance of these methods in the presence of noise, humans can perform very well as mentioned in the introduction. The table below shows the translations performed by a German native-speaker human, not familiar with the meme and three machine translation methods. Clearly, the machine translation methods failed.
 +
 +
[[File:paper16_tab4.png|center]]
 +
 +
The author also examined improvements by using a simple spell checker. The author tried correcting error through Google's spell checker by simply accepting the first suggestion on the detected mistake. There was a small improvement in French and German translations, and a small drop in accuracy for the Czech translation due to more complex grammar. The author concluded using existing spell checkers would not improve the accuracy to be comparable with vanilla text. The results are shown in the table below.
 +
 +
 +
[[File:paper16_tab5.png|center]]
  
 
== Dealing with noise ==
 
== Dealing with noise ==
=== STRUCTURE INVARIANT REPRESENTATIONS ===
+
=== Structure Invariant Representations ===
The three NMT models are all sensitive to word structure. The <code>char2char</code> and <code>charCNN</code> models both have convolutional layers on character sequences, designed to capture character n-grams. The model in <code>Nematus</code> is based on sub-word units obtained with BPE. It thus relies on character order.
+
The three NMT models are all sensitive to word structure. The <code>char2char</code> and <code>charCNN</code> models both have convolutional layers on character sequences, designed to capture character n-grams (which are sequences of characters or words, of length n). The model in <code>Nematus</code> is based on sub-word units obtained with byte pair encoding (where common consecutive characters are replaced with a unique byte that does not occur in the data). It thus relies on character order.
  
The simplest to improve such model is to take the average character embeddings as a word representation. This model, referred to as <code>meanChar</code>, first generates a word representation by averaging character embeddings, and then proceeds with a word-level encoder similar to the <code>charCNN</code> model.
+
The simplest way to improve such a model is to take the average character embeddings as a word representation. This model, referred to as <code>meanChar</code>, first generates a word representation by averaging character embeddings, and then proceeds with a word-level encoder similar to the <code>charCNN</code> model.
  
[[File:Table5x.PNG]]
+
[[File:Table5x.PNG|center]]
  
<code>meanChar</code> is good with the other three scrambling errors (Swap, Middle Random and Fully Random), but bad with Keyboard error and Natural errors.
+
<code>meanChar</code> is good with the other three scrambling errors (Swap, Middle Random and Fully Random), but bad with Keyboard errors and Natural errors.
  
=== BLACK-BOX ADVERSARIAL TRAINING ===
+
=== Black-Box Adversarial Training ===
  
 
<code>charCNN</code> Performance
 
<code>charCNN</code> Performance
[[File:Table6x.PNG]]
+
[[File:Table6x.PNG|center]]
 +
 
 +
Here is the result of the translation of the scrambled meme:
 +
“According to a study of Cambridge University, it doesn’t matter which technology in a word is going to get the letters in a word that is the only important thing for the first and last letter.”
  
 
== Analysis ==
 
== Analysis ==
=== LEARNING MULTIPLE KINDS OF NOISE IN <code>charCNN</code> ===
+
=== Learning Multiple Kinds of Noise in <code>charCNN</code> ===
They analyze the weights learned by <code>charCNN</code> models trained on two kinds of input: completely scrambled words (Rand) without other kinds of noise, and a mix of Rand+Key+Nat kinds of noise.
+
 
 +
As Table 6 above shows, <code>charCNN</code> models performed quite well across different noise types on the test set when they are trained on a mix of noise types, which led the authors to speculate that filters from different convolutional layers learned to be robust to different types of noises. To test this hypothesis, they analyzed the weights learned by <code>charCNN</code> models trained on two kinds of input: completely scrambled words (Rand) without other kinds of noise, and a mix of Rand+Key+Nat kinds of noise. For each model, they computed the variance across the filter dimension for each one of the 1000 filters and for each one of the 25 character embedding dimensions, which were then averaged across the filters to yield 25 variances.
 +
 
 +
As Figure 2 below shows, the variances for the ensemble model are higher and more varied, which indicates that the filters learned different patterns and the model differentiated between different character embedding dimensions. Under the random scrambling scheme, there should be no patterns for the model to learn, so it makes sense for the filter weights to stay close uniform weights, hence the consistently lower variance measures.
 +
 
 +
[[File:Table7x.PNG|center]]
 +
 
 +
=== Richness of Natural Noise ===
 +
[[File:SNNoise_NatNoiseExp.png|750px|right]]
 +
The synthetic noise used in this paper appears to be very different from natural noise. This is evident because none of the modes trained only on synthetic noise demonstrated good performance on natural noise. Therefore, the authors say that the noise models used in this paper are not representative of real noise and that a more sophisticated model using explicit phonemic and linguistic knowledge is required if an error-free corpus is to be augmented with error for training. The synthetic noise analysed is lacking a two common types of typos: inserting a character that is adjacent (on the keyboard) to a letter and omitting letters.
  
For each model, they compute the variance across the filter dimension for each one of the 1000 filters and for each one out of 25 character embedding dimensions. The we average the variances across the 1000 filters.  
+
During a manual analysis of a small subset of the German dataset, the natural noise was found to be comprised of:
 +
* 34% Phonetic error
 +
* 32% Character omissions
 +
* 34% Other: Morphological, Key swap, ect.
  
[[File:Table7x.PNG]]
+
Examples of these types of errors can be seen in Table 8.
  
 
== Conclusion ==
 
== Conclusion ==
In this work, they have shown that character-based NMT models are extremely brittle and tend to break when presented with both natural and synthetic kinds of noise. After models comparison, they found that a character-based CNN can learn to
+
In this work, the authors have shown that character-based NMT models are extremely brittle and tend to break when presented with both natural and synthetic kinds of noise. After a comparison of the models, they found that a character-based CNN can learn to
address multiple types of errors that are seen in training.
+
address multiple types of errors that are seen in training. For the future work, the author suggested generating more realistic synthetic noise by using phonetic and syntactic structure. Also, they suggested that a better NMT architecture could be designed which can be robust to noise without seeing it in the training data. The author believe that more work is necessary in order to immune NMT models against natural noise. As corpora with natural noise are limited, another approach to future work is to design better NMT architectures that would be robust to noise without seeing it in the training data. New psychology results on how humans cope with natural noise might point to possible solutions to this problem.
 +
 
 +
== Criticism ==
 +
According to the [https://openreview.net/forum?id=BJ8vJebC- OpenReview thread], a major critique of this paper is that the solutions presented do not adequately solve the problem. The response to the meanChar architecture has been mostly negative and the method of noise injection has been seen as a simple start. However, the authors have acknowledged these critiques stating that they realize their solution is just a starting point. They argue that this paper has opened the discussion on dealing with noise in machine translation which has been mostly left untouched. Also these solutions/models still do not tackle the problem of natural noise as the models trained on the synthetic noise don't generalize well to natural noise. Also, a minor issue is that in Table 4, the results of machine translation from without noise is not included as a comparison.
 +
 
 +
== References ==
 +
# Yonatan Belinkov and Yonatan Bisk. Synthetic and Natural Noise Both Break Neural Machine Translation. In ''International Conference on Learning Representations (ICLR)'', 2017.
 +
# Mauro Cettolo, Christian Girardi, and Marcello Federico. WIT: Web Inventory of Transcribed and Translated Talks. In ''Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT)'', pp. 261–268, Trento, Italy, May 2012.
 +
# Jason Lee, Kyunghyun Cho, and Thomas Hofmann. Fully Character-Level Neural Machine Translation without Explicit Segmentation. ''Transactions of the Association for Computational Linguistics (TACL)'', 2017.
 +
# Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Laubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. Nematus: a Toolkit for Neural Machine Translation. In ''Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics'', pp. 65–68, Valencia, Spain, April 2017. Association for Computational Linguistics. URL http://aclweb.org/anthology/E17-3017.
 +
# Aurlien Max and Guillaume Wisniewski. Mining Naturally-occurring Corrections and Paraphrases from Wikipedias Revision History. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta, may 2010. European Language Resources Association (ELRA). ISBN 2-9517408-6-7. URL https://wicopaco.limsi.fr.
 +
# Katrin Wisniewski, Karin Schne, Lionel Nicolas, Chiara Vettori, Adriane Boyd, Detmar Meurers, Andrea Abel, and Jirka Hana. MERLIN: An online trilingual learner corpus empirically grounding the European Reference Levels in authentic learner data, 10 2013. URL https://www.ukp.tu-darmstadt.de/data/spelling-correction/rwse-datasets.
 +
# Torsten Zesch. Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 529–538, Avignon, France, April 2012. Association for Computational Linguistics.
 +
# Suranjana Samanta and Sameep Mehta. Towards Crafting Text Adversarial Samples. arXiv preprint arXiv:1707.02812, 2017. Karel Sebesta, Zuzanna Bedrichova, Katerina Sormov́a, Barbora Stindlov́a, Milan Hrdlicka, Tereza Hrdlickov́a, Jiŕı Hana, Vladiḿır Petkevic, Toḿas Jeĺınek, Svatava Skodov́a, Petr Janes, Katerina Lund́akov́a, Hana Skoumalov́a, Simon Sĺadek, Piotr Pierscieniak, Dagmar Toufarov́a, Milan Straka, Alexandr Rosen, Jakub Ńaplava, and Marie Poĺackova. CzeSL grammatical error correction dataset (CzeSL-GEC). Technical report, LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University, 2017. URL https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2143.

Latest revision as of 23:15, 20 April 2018

Introduction

Humans have surprisingly robust language processing systems which can easily overcome disordered words, like the following example illustrated, an human reader may recognize the meaning of the following sentence with not much difficulty,

  • "Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae."

A person's ability to read this text comes as no surprise to the Psychology literature

  1. Saberi & Perrott (1999) found that this robustness extends to audio as well.
  2. Rayner et al. (2006) found that in noisier settings reading comprehension only slowed by 11%.
  3. McCusker et al. (1981) found that the common case of swapping letters could often go unnoticed by the reader.
  4. Mayall et al (1997) shows that we rely on word shape.
  5. Reicher, 1969; Pelli et al., (2003) found that we can switch between whole word recognition but the first and last letter positions are required to stay constant for comprehension

However, neural machine translation (NMT) systems are brittle. i.e. The Arabic word Good morning.PNG means a blessing for good morning, however Hunt.PNG means hunt or slaughter.

Facebook's MT system mistakenly confused two words that only differ by one character, a situation that is challenging for a character-based NMT system.

The figure below shows the performance translating German to English as a function of the percent of German words modified. Here two types of noise are shown: (1) In blue, random permutation of the word and (2) In green, swapping a pair of adjacent letters that does not include the first or last letter of the word. The important thing to note is that even small amounts of noise lead to substantial drops in performance.

BLEU plot.PNG

BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is". BLEU is between 0 and 1. BELU computes the scores for individual translated segments and then computes an average accuracy score for the whole corpus.

This paper explores two simple strategies for increasing model robustness:

  1. using structure-invariant representations (character CNN representation)
  2. robust training on noisy data, a form of adversarial training.

The goal of the paper is two-fold:

  1. to initiate a conversation on robust training and modeling techniques in NMT
  2. to promote the creation of better and more linguistically accurate artificial noise to be applied to new languages and tasks

Adversarial examples

The growing literature on adversarial examples has demonstrated how dangerous it can be to have brittle machine learning systems being used so pervasively in the real world. Small changes to the input can lead to dramatic failures of deep learning models. This leads to a potential for malicious attacks using adversarial examples. An important distinction is often drawn between white-box attacks, where adversarial examples are generated with access to the model parameters, and black-box attacks, where examples are generated without such access.

The paper devises simple methods for generating adversarial examples for NMT. They do not assume any access to the NMT models' gradients, instead relying on cognitively-informed and naturally occurring language errors to generate noise.

MT system

The authors experiment with three different NMT systems with access to character information at different levels.

  1. Use char2char, the fully character-level model of (Lee et al. 2017). This model processes a sentence as a sequence of characters. The encoder works as follows: the characters are embedded as vectors, and then the sequence of vectors is fed to a convolutional layer. The sequence output by the convolutional layer is then shortened by max pooling in the time dimension. The output of the max-pooling layer is then fed to a four-layer highway network (Srivasta et al. 2015), and the output of the highway network is in turn fed to a bidirectional GRU, producing a sequence of hidden units. The sequence of hidden units is then processed by the decoder, a GRU with attention, to produce probabilities over sequences of output characters.
  2. Use Nematus (Sennrich et al., 2017), a popular NMT toolkit. It is another sequence-to-sequence model with several architecture modifications, especially operating on sub-word units using byte-pair encoding. Byte-pair encoding (Sennich et al. 2015, Gage 1994) is an algorithm according to which we begin with a list of characters as our symbols, and repeatedly fuse common combinations to create new symbols. For example, if we begin with the letters a to z as our symbol list, and we find that "th" is the most common two-letter combination in a corpus, then we would add "th" to our symbol list in the first iteration. After we have used this algorithm to create a symbol list of the desired size, we apply a standard encoder-decoder with attention.
  3. Use an attentional sequence-to-sequence model with a word representation based on a character convolutional neural network (charCNN). The charCNN model is similar to char2char, but uses a shallower highway network and, although it reads the input sentence as characters, it produces as output a probability distribution over words, not characters.

Data

MT Data

The authors use the TED talks parallel corpus prepared for IWSLT 2016 (Cettolo et al., 2012) for testing all of the NMT systems.

Table1x.PNG

Natural and Artificial Noise

Natural Noise

The three languages, French, German, and Czech, each have their own frequent natural errors. The corpora of edits used for these languages are:

  1. French : Wikipedia Correction and Paraphrase Corpus (WiCoPaCo)
  2. German : RWSE Wikipedia Correction Dataset and The MERLIN corpus
  3. Czech : CzeSL Grammatical Error Correction Dataset (CzeSL-GEC) which is a manually annotated dataset of essays written by both non-native learners of Czech and Czech pupils

The authors harvested naturally occurring errors (typos, misspellings, etc.) corresponding to these three languages from available corpora of edits to build a look-up table of possible lexical replacements.

They insert these errors into the source-side of the parallel data by replacing every word in the corpus with an error if one exists in our dataset. When there is more than one possible replacement to choose, words for which there is no error, are sampled uniformly and kept as is.

Synthetic Noise

In addition to naturally collected sources of error, the authors also experiment with four types of synthetic noise: Swap, Middle Random, Fully Random, and Key Typo.

  1. Swap: The first and simplest source of noise is swapping two letters (do not alter the first or last letters, only apply to words of length >=4).
  2. Middle Random: Randomize the order of all the letters in a word except for the first and last (only apply to words of length >=4).
  3. Fully Random Completely randomized words.
  4. Keyboard Typo Randomly replace one letter in each word with an adjacent key
Table3x.PNG

Table 3 shows BLEU scores of models trained on clean (Vanilla) texts and tested on clean and noisy texts. All models suffer a significant drop in BLEU when evaluated on noisy texts. This is true for both natural noise and all kinds of synthetic noise. The more noise in the text, the worse the translation quality, with random scrambling producing the lowest BLEU scores.

In contrast to the poor performance of these methods in the presence of noise, humans can perform very well as mentioned in the introduction. The table below shows the translations performed by a German native-speaker human, not familiar with the meme and three machine translation methods. Clearly, the machine translation methods failed.

paper16 tab4.png

The author also examined improvements by using a simple spell checker. The author tried correcting error through Google's spell checker by simply accepting the first suggestion on the detected mistake. There was a small improvement in French and German translations, and a small drop in accuracy for the Czech translation due to more complex grammar. The author concluded using existing spell checkers would not improve the accuracy to be comparable with vanilla text. The results are shown in the table below.


paper16 tab5.png

Dealing with noise

Structure Invariant Representations

The three NMT models are all sensitive to word structure. The char2char and charCNN models both have convolutional layers on character sequences, designed to capture character n-grams (which are sequences of characters or words, of length n). The model in Nematus is based on sub-word units obtained with byte pair encoding (where common consecutive characters are replaced with a unique byte that does not occur in the data). It thus relies on character order.

The simplest way to improve such a model is to take the average character embeddings as a word representation. This model, referred to as meanChar, first generates a word representation by averaging character embeddings, and then proceeds with a word-level encoder similar to the charCNN model.

Table5x.PNG

meanChar is good with the other three scrambling errors (Swap, Middle Random and Fully Random), but bad with Keyboard errors and Natural errors.

Black-Box Adversarial Training

charCNN Performance

Table6x.PNG

Here is the result of the translation of the scrambled meme: “According to a study of Cambridge University, it doesn’t matter which technology in a word is going to get the letters in a word that is the only important thing for the first and last letter.”

Analysis

Learning Multiple Kinds of Noise in charCNN

As Table 6 above shows, charCNN models performed quite well across different noise types on the test set when they are trained on a mix of noise types, which led the authors to speculate that filters from different convolutional layers learned to be robust to different types of noises. To test this hypothesis, they analyzed the weights learned by charCNN models trained on two kinds of input: completely scrambled words (Rand) without other kinds of noise, and a mix of Rand+Key+Nat kinds of noise. For each model, they computed the variance across the filter dimension for each one of the 1000 filters and for each one of the 25 character embedding dimensions, which were then averaged across the filters to yield 25 variances.

As Figure 2 below shows, the variances for the ensemble model are higher and more varied, which indicates that the filters learned different patterns and the model differentiated between different character embedding dimensions. Under the random scrambling scheme, there should be no patterns for the model to learn, so it makes sense for the filter weights to stay close uniform weights, hence the consistently lower variance measures.

Table7x.PNG

Richness of Natural Noise

SNNoise NatNoiseExp.png

The synthetic noise used in this paper appears to be very different from natural noise. This is evident because none of the modes trained only on synthetic noise demonstrated good performance on natural noise. Therefore, the authors say that the noise models used in this paper are not representative of real noise and that a more sophisticated model using explicit phonemic and linguistic knowledge is required if an error-free corpus is to be augmented with error for training. The synthetic noise analysed is lacking a two common types of typos: inserting a character that is adjacent (on the keyboard) to a letter and omitting letters.

During a manual analysis of a small subset of the German dataset, the natural noise was found to be comprised of:

  • 34% Phonetic error
  • 32% Character omissions
  • 34% Other: Morphological, Key swap, ect.

Examples of these types of errors can be seen in Table 8.

Conclusion

In this work, the authors have shown that character-based NMT models are extremely brittle and tend to break when presented with both natural and synthetic kinds of noise. After a comparison of the models, they found that a character-based CNN can learn to address multiple types of errors that are seen in training. For the future work, the author suggested generating more realistic synthetic noise by using phonetic and syntactic structure. Also, they suggested that a better NMT architecture could be designed which can be robust to noise without seeing it in the training data. The author believe that more work is necessary in order to immune NMT models against natural noise. As corpora with natural noise are limited, another approach to future work is to design better NMT architectures that would be robust to noise without seeing it in the training data. New psychology results on how humans cope with natural noise might point to possible solutions to this problem.

Criticism

According to the OpenReview thread, a major critique of this paper is that the solutions presented do not adequately solve the problem. The response to the meanChar architecture has been mostly negative and the method of noise injection has been seen as a simple start. However, the authors have acknowledged these critiques stating that they realize their solution is just a starting point. They argue that this paper has opened the discussion on dealing with noise in machine translation which has been mostly left untouched. Also these solutions/models still do not tackle the problem of natural noise as the models trained on the synthetic noise don't generalize well to natural noise. Also, a minor issue is that in Table 4, the results of machine translation from without noise is not included as a comparison.

References

  1. Yonatan Belinkov and Yonatan Bisk. Synthetic and Natural Noise Both Break Neural Machine Translation. In International Conference on Learning Representations (ICLR), 2017.
  2. Mauro Cettolo, Christian Girardi, and Marcello Federico. WIT: Web Inventory of Transcribed and Translated Talks. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), pp. 261–268, Trento, Italy, May 2012.
  3. Jason Lee, Kyunghyun Cho, and Thomas Hofmann. Fully Character-Level Neural Machine Translation without Explicit Segmentation. Transactions of the Association for Computational Linguistics (TACL), 2017.
  4. Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Laubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. Nematus: a Toolkit for Neural Machine Translation. In Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pp. 65–68, Valencia, Spain, April 2017. Association for Computational Linguistics. URL http://aclweb.org/anthology/E17-3017.
  5. Aurlien Max and Guillaume Wisniewski. Mining Naturally-occurring Corrections and Paraphrases from Wikipedias Revision History. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta, may 2010. European Language Resources Association (ELRA). ISBN 2-9517408-6-7. URL https://wicopaco.limsi.fr.
  6. Katrin Wisniewski, Karin Schne, Lionel Nicolas, Chiara Vettori, Adriane Boyd, Detmar Meurers, Andrea Abel, and Jirka Hana. MERLIN: An online trilingual learner corpus empirically grounding the European Reference Levels in authentic learner data, 10 2013. URL https://www.ukp.tu-darmstadt.de/data/spelling-correction/rwse-datasets.
  7. Torsten Zesch. Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 529–538, Avignon, France, April 2012. Association for Computational Linguistics.
  8. Suranjana Samanta and Sameep Mehta. Towards Crafting Text Adversarial Samples. arXiv preprint arXiv:1707.02812, 2017. Karel Sebesta, Zuzanna Bedrichova, Katerina Sormov́a, Barbora Stindlov́a, Milan Hrdlicka, Tereza Hrdlickov́a, Jiŕı Hana, Vladiḿır Petkevic, Toḿas Jeĺınek, Svatava Skodov́a, Petr Janes, Katerina Lund́akov́a, Hana Skoumalov́a, Simon Sĺadek, Piotr Pierscieniak, Dagmar Toufarov́a, Milan Straka, Alexandr Rosen, Jakub Ńaplava, and Marie Poĺackova. CzeSL grammatical error correction dataset (CzeSL-GEC). Technical report, LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University, 2017. URL https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2143.