http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=X46yan&feedformat=atomstatwiki - User contributions [US]2022-01-19T12:01:38ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT&diff=48641BERTScore: Evaluating Text Generation with BERT2020-12-01T05:08:46Z<p>X46yan: </p>
<hr />
<div>== Presented by == <br />
Gursimran Singh<br />
<br />
== Introduction == <br />
In recent times, various machine learning approaches for text generation have gained popularity. This paper aims to develop an automatic metric that will judge the quality of the generated text. Commonly used state of the art metrics either use n-gram approaches or word embeddings for calculating the similarity between the reference and the candidate sentence. BertScore, on the other hand, calculates the similarity using contextual embeddings. BertScore basically addresses two common pitfalls in n-gram-based metrics. Firstly, the n-gram models fail to robustly match paraphrases which leads to performance underestimation when semantically-correct phrases are penalized because of their difference from the surface form of the reference. On the other hand in BertScore, the similarity is computed using contextualized token embeddings, which have been shown to be effective for paraphrase detection. Secondly, n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes. In contrast, contextualized embeddings capture distant dependencies and ordering effectively. The authors of the paper have carried out various experiments in Machine Translation and Image Captioning to show why BertScore is more reliable and robust than the previous approaches.<br />
<br />
''' Word versus Context Embeddings '''<br />
<br />
Both models aim to reduce the sparseness invoked by a bag of words (BoW) representation of text due to the high dimensional vocabularies. Both methods create embeddings of a dimensionality much lower than sparse BoW and aim to capture semantics and context. Word embeddings differ in that they will be deterministic as when given a word embedding model will always produce the same embedding, regardless of the surrounding words. However, contextual embeddings will create different embeddings for a word depending on the surrounding words in the given text.<br />
<br />
== Previous Work ==<br />
Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences. <br />
The most popular n-Gram Matching metric is BLEU. It follows the underlying principle of n-Gram matching and its uniqueness comes from three main factors. <br><br />
• Each n-Gram is matched at most once. <br><br />
• The total of exact-matches is accumulated for all reference candidate pairs and divided by the total number of <math>n</math>-grams in all candidate sentences. <br><br />
• Very short candidates are restricted. <br><br />
<br />
Further BLEU is generally calculated for multiple <math>n</math>-grams and averaged geometrically.<br />
n-Gram approaches also include METEOR, NIST, ΔBLEU, etc.<br />
<br />
Most of these methods utilize or slightly modify the exact match precision (Exact-<math>P_n</math>) and recall (Exact-<math>R_n</math>) scores. These scores can be formalized as follows:<br />
<br />
Exact- <math> P_n = \frac{\sum_{w \ in S^{n}_{ \hat{x} }} \mathbb{I}[w \in S^{n}_{x}]}{S^{n}_{\hat{x}}} </math> <br />
<br />
Exact- <math> R_n = \frac{\sum_{w \ in S^{n}_{x}} \mathbb{I}[w \in S^{n}_{\hat{x}}]}{S^{n}_{x}} </math> <br />
<br />
Here <math>S^{n}_{x}</math> and <math>S^{n}_{\hat{x}}</math> are lists of token <math>n</math>-grams in the reference <math>x</math> and candidate <math>\hat{x}</math> sentences respectively.<br />
<br />
Other categories include Edit-distance-based Metrics, Embedding-based metrics, and Learned Metrics. Most of these techniques do not capture the context of a word in the sentence. Moreover, Learned Metric approaches also require costly human judgments as supervision for each dataset.<br />
<br />
== Motivation ==<br />
The <math>n</math>-gram approaches like BLEU do not capture the positioning and the context of the word and simply rely on exact matching for evaluation. Consider the following example that shows how BLEU can result in incorrect judgment. <br><br />
Reference: people like foreign cars <br><br />
Candidate 1: people like visiting places abroad <br><br />
Candidate 2: consumers prefer imported cars<br />
<br />
BLEU gives a higher score to Candidate 1 as compared to Candidate 2. This undermines the performance of text generation models since contextually correct sentences are penalized. In contrast, some semantically different phrases are scored higher just because they are closer to the surface form of the reference sentence. <br />
<br />
On the other hand, BERTScore computes similarity using contextual token embeddings. It helps in detecting semantically correct paraphrased sentences. It also captures the cause and effect relationship (A gives B in place of B gives A) that the BLEU score isn't detected.<br />
<br />
== BERTScore Architecture ==<br />
Fig 1 summarizes the steps for calculating the BERTScore. Next, we will see details about each step. Here, the reference sentence is given by <math> x = ⟨x1, . . . , xk⟩ </math> and candidate sentence <math> \hat{x} = ⟨\hat{x1}, . . . , \hat{xl}⟩. </math> <br><br />
<br />
<div align="center"> [[File:Architecture_BERTScore.PNG|Illustration of the computation of BERTScore.]] </div><br />
<div align="center">'''Fig 1'''</div><br />
<br />
=== Token Representation ===<br />
Reference and candidate sentences are represented using contextual embeddings. Word embedding techniques inspire this but in contrast to word embeddings, the contextual embedding of a word depends upon the surrounding words in the sentence. These contextual embeddings are calculated using BERT and other similar models which utilize self-attention and nonlinear transformations.<br />
<br />
<div align="center"> [[File:Pearsson_corr_contextual_emb.PNG|Pearson Correlation for Contextual Embedding]] </div><br />
<div align="center">'''Fig 2'''</div><br />
<br />
=== Cosine Similarity ===<br />
Pairwise cosine similarity is calculated between each token <math> x_{i} </math> in reference sentence and <math> \hat{x}_{j} </math> in candidate sentence. Prenormalized vectors are used, therefore the pairwise similarity is given by <math> x_{i}^T \hat{x_{i}}. </math><br />
<br />
=== BERTScore ===<br />
<br />
Each token in x is matched to the most similar token in <math> \hat{x} </math> and vice-versa for calculating Recall and Precision respectively. The matching is greedy and isolated. Precision and Recall are combined for calculating the F1 score. The equations for calculating Precision, Recall, and F1 Score are as follows<br />
<br />
<div align="center"> [[File:Equations.PNG|Equations for the calculation of BERTScore.]] </div><br />
<br />
<br />
=== Importance Weighting (optional) ===<br />
In some cases, rare words can be highly indicative of sentence similarity. Therefore, Inverse Document Frequency (idf) can be used with the above equations of the BERTScore. This is optional and depending on the domain of the text and the available data it may or may not benefit the final results. Thus understanding more about Importance Weighing is an open area of research.<br />
<br />
=== Baseline Rescaling ===<br />
Rescaling is done only to increase the human readability of the score. In theory, cosine similarity values are between -1 and 1 but practically they are confined in a much smaller range. A value b computed using Common Crawl monolingual datasets is used to linearly rescale the BERTScore. The rescaled recall <math> \hat{R}_{BERT} </math> is given by<br />
<div align="center"> [[File:Equation2.PNG|Equation for the rescaled BERTScore.]] </div><br />
Similarly, <math> P_{BERT} </math> and <math> F_{BERT} </math> are rescaled as well.<br />
<br />
== Experiment & Results ==<br />
The authors have experimented with different pre-trained contextual embedding models like BERT, RoBERTa, etc, and reported the best performing model results. The evaluation has been done on Machine Translation and Image Captioning tasks. <br />
<br />
=== Machine Translation ===<br />
The metric evaluation dataset consists of 149 translation systems, gold references, and two types of human judgments, namely, Segment-level human judgments and System-level human judgments. The former assigns a score to each reference candidate pair and the latter associates a single score for the whole system. Segment-level outputs for BERTScore are calculated as explained in the previous section on architecture and the System-level outputs are calculated by taking an average of BERTScore for every reference-candidate pair. Absolute Pearson Correlation <math> \lvert \rho \rvert </math> and Kendall rank correlation <math> \tau </math> are used for calculating metric quality, Williams test <sup> [1] </sup> for significance of <math> \lvert \rho \rvert </math> and Graham & Baldwin <sup> [2] </sup> methods for calculating the bootstrap resampling of <math> \tau </math>. The authors have also created hybrid systems by randomly sampling one candidate sentence for each reference sentence from one of the systems. This increases the volume of systems for System-level experiments. Further, the authors have also randomly selected 100 systems out of 10k hybrid systems for ranking them using automatic metrics. They have repeated this process multiple times and generated Hits@1, which contains the percentage of the metric ranking agreeing with human ranking on the best system. <br />
<br />
<div align="center"> '''The following 4 tables show the result of the experiments mentioned above.''' </div> <br><br />
<br />
<div align="center"> [[File:Table1_BERTScore.PNG|700px| Table1 Machine Translation]] [[File:Table2_BERTScore.PNG|700px| Table2 Machine Translation]] </div><br />
<div align="center"> [[File:Table3_BERTScore.PNG|700px| Table3 Machine Translation]] [[File:Table4_BERTScore.PNG|700px| Table4 Machine Translation]] </div><br />
<br />
In all 4 tables, we can see that BERTScore is consistently a top performer. It also gives a large improvement over the current state-of-the-art BLEU score. In to-English translation, RUSE shows competitive results but it is a learned metric technique and requires costly human judgments as supervision.<br />
<br />
=== Image Captioning ===<br />
For Image Captioning, human judgment for 12 submission entries from the COCO 2015 Captioning Challenge is used. As per Cui et al. (2018) <sup> [3] </sup>, Pearson Correlation with two System-Level metrics is calculated. The metrics are the percentage of captions better or equal to human captions (M1) and the percentage of captions that are indistinguishable from human captions (M2). There are approximately 5 reference captions and the BERTScore is taken to be the maximum of all the BERTScores individually with each reference caption. BERTScore is compared with 8 task-agnostic metrics and 2 task-specific metrics. <br />
<br />
<div align="center"> [[File:Table5_BERTScore.PNG|450px| Table5 Image Captioning]] </div><br />
<br />
<div align="center"> '''Table 5: Pearson correlation on the 2015 COCO Captioning Challenge.''' </div><br />
<br />
BERTScore is again a top performer and n-gram metrics like BLEU show a weak correlation with human judgments. For this task, importance weighting shows significant improvement depicting the importance of content words. <br />
<br />
'''Speed:''' The time taken for calculating BERTScore is not significantly higher than BLEU. For example, with the same hardware, the Machine Translation test on BERTScore takes 15.6 secs compared to 5.4 secs for BLEU. The time range is essentially small and thus the difference is marginal.<br />
<br />
== Robustness Analysis ==<br />
The authors tested BERTScore's robustness using two adversarial paraphrase classification datasets, QQP and PAWS. The table below summarized the result. Most metrics have good performance on QQP, but their performance drop significantly on PAWS. Conversely, BERTScore performs competitively on PAWS, which suggests BERTScore is better at distinguishing harder adversarial examples.<br />
<br />
[[File: bertscore.png | 500px]]<br />
== Source Code == <br />
The code for this paper is available at [https://github.com/Tiiiger/bert_score BERTScore].<br />
<br />
== Critique & Future Prospects==<br />
A text evaluation metric BERTScore is proposed which outperforms the previous approaches because of its capacity to use contextual embeddings for evaluation. It is simple and easy to use. BERTScore is also more robust than previous approaches. This is shown by the experiments carried on the datasets consisting of paraphrased sentences. There are variants of BERTScore depending upon the contextual embedding model, use of importance weighting, and the evaluation metric (Precision, Recall, or F1 score). <br />
<br />
The main reason behind the success of BERTScore is the use of contextual embeddings. The remaining architecture is straightforward in itself. There are some word embedding models that use complex metrics for calculating similarity. If we try to use those models along with contextual embeddings instead of word embeddings, they might result in more reliable performance than the BERTScore.<br />
<br />
<br />
The paper was quite interesting, but it is obvious that they lack technical novelty in their proposed approach. Their method is a natural application of BERT along with traditional cosine similarity measures and precision, recall, F1-based computations, and simple IDF-based importance weighting.<br />
<br />
== References ==<br />
<br />
[1] Evan James Williams. Regression analysis. wiley, 1959.<br />
<br />
[2] Yvette Graham and Timothy Baldwin. Testing for significance of increased correlation with human judgment. In EMNLP, 2014.<br />
<br />
[3] Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge J. Belongie. Learning to evaluate image captioning. In CVPR, 2018.<br />
<br />
[4] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.<br />
<br />
[5] Qingsong Ma, Ondrej Bojar, and Yvette Graham. Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance. In WMT, 2018.<br />
<br />
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.<br />
<br />
[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. arXiv, abs/1907.11692, 2019b.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT&diff=48639BERTScore: Evaluating Text Generation with BERT2020-12-01T05:08:19Z<p>X46yan: </p>
<hr />
<div>== Presented by == <br />
Gursimran Singh<br />
<br />
== Introduction == <br />
In recent times, various machine learning approaches for text generation have gained popularity. This paper aims to develop an automatic metric that will judge the quality of the generated text. Commonly used state of the art metrics either use n-gram approaches or word embeddings for calculating the similarity between the reference and the candidate sentence. BertScore, on the other hand, calculates the similarity using contextual embeddings. BertScore basically addresses two common pitfalls in n-gram-based metrics. Firstly, the n-gram models fail to robustly match paraphrases which leads to performance underestimation when semantically-correct phrases are penalized because of their difference from the surface form of the reference. On the other hand in BertScore, the similarity is computed using contextualized token embeddings, which have been shown to be effective for paraphrase detection. Secondly, n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes. In contrast, contextualized embeddings capture distant dependencies and ordering effectively. The authors of the paper have carried out various experiments in Machine Translation and Image Captioning to show why BertScore is more reliable and robust than the previous approaches.<br />
<br />
''' Word versus Context Embeddings '''<br />
<br />
Both models aim to reduce the sparseness invoked by a bag of words (BoW) representation of text due to the high dimensional vocabularies. Both methods create embeddings of a dimensionality much lower than sparse BoW and aim to capture semantics and context. Word embeddings differ in that they will be deterministic as when given a word embedding model will always produce the same embedding, regardless of the surrounding words. However, contextual embeddings will create different embeddings for a word depending on the surrounding words in the given text.<br />
<br />
== Previous Work ==<br />
Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences. <br />
The most popular n-Gram Matching metric is BLEU. It follows the underlying principle of n-Gram matching and its uniqueness comes from three main factors. <br><br />
• Each n-Gram is matched at most once. <br><br />
• The total of exact-matches is accumulated for all reference candidate pairs and divided by the total number of <math>n</math>-grams in all candidate sentences. <br><br />
• Very short candidates are restricted. <br><br />
<br />
Further BLEU is generally calculated for multiple <math>n</math>-grams and averaged geometrically.<br />
n-Gram approaches also include METEOR, NIST, ΔBLEU, etc.<br />
<br />
Most of these methods utilize or slightly modify the exact match precision (Exact-<math>P_n</math>) and recall (Exact-<math>R_n</math>) scores. These scores can be formalized as follows:<br />
<br />
Exact- <math> P_n = \frac{\sum_{w \ in S^{n}_{ \hat{x} }} \mathbb{I}[w \in S^{n}_{x}]}{S^{n}_{\hat{x}}} </math> <br />
<br />
Exact- <math> R_n = \frac{\sum_{w \ in S^{n}_{x}} \mathbb{I}[w \in S^{n}_{\hat{x}}]}{S^{n}_{x}} </math> <br />
<br />
Here <math>S^{n}_{x}</math> and <math>S^{n}_{\hat{x}}</math> are lists of token <math>n</math>-grams in the reference <math>x</math> and candidate <math>\hat{x}</math> sentences respectively.<br />
<br />
Other categories include Edit-distance-based Metrics, Embedding-based metrics, and Learned Metrics. Most of these techniques do not capture the context of a word in the sentence. Moreover, Learned Metric approaches also require costly human judgments as supervision for each dataset.<br />
<br />
== Motivation ==<br />
The <math>n</math>-gram approaches like BLEU do not capture the positioning and the context of the word and simply rely on exact matching for evaluation. Consider the following example that shows how BLEU can result in incorrect judgment. <br><br />
Reference: people like foreign cars <br><br />
Candidate 1: people like visiting places abroad <br><br />
Candidate 2: consumers prefer imported cars<br />
<br />
BLEU gives a higher score to Candidate 1 as compared to Candidate 2. This undermines the performance of text generation models since contextually correct sentences are penalized. In contrast, some semantically different phrases are scored higher just because they are closer to the surface form of the reference sentence. <br />
<br />
On the other hand, BERTScore computes similarity using contextual token embeddings. It helps in detecting semantically correct paraphrased sentences. It also captures the cause and effect relationship (A gives B in place of B gives A) that the BLEU score isn't detected.<br />
<br />
== BERTScore Architecture ==<br />
Fig 1 summarizes the steps for calculating the BERTScore. Next, we will see details about each step. Here, the reference sentence is given by <math> x = ⟨x1, . . . , xk⟩ </math> and candidate sentence <math> \hat{x} = ⟨\hat{x1}, . . . , \hat{xl}⟩. </math> <br><br />
<br />
<div align="center"> [[File:Architecture_BERTScore.PNG|Illustration of the computation of BERTScore.]] </div><br />
<div align="center">'''Fig 1'''</div><br />
<br />
=== Token Representation ===<br />
Reference and candidate sentences are represented using contextual embeddings. Word embedding techniques inspire this but in contrast to word embeddings, the contextual embedding of a word depends upon the surrounding words in the sentence. These contextual embeddings are calculated using BERT and other similar models which utilize self-attention and nonlinear transformations.<br />
<br />
<div align="center"> [[File:Pearsson_corr_contextual_emb.PNG|Pearson Correlation for Contextual Embedding]] </div><br />
<div align="center">'''Fig 2'''</div><br />
<br />
=== Cosine Similarity ===<br />
Pairwise cosine similarity is calculated between each token <math> x_{i} </math> in reference sentence and <math> \hat{x}_{j} </math> in candidate sentence. Prenormalized vectors are used, therefore the pairwise similarity is given by <math> x_{i}^T \hat{x_{i}}. </math><br />
<br />
=== BERTScore ===<br />
<br />
Each token in x is matched to the most similar token in <math> \hat{x} </math> and vice-versa for calculating Recall and Precision respectively. The matching is greedy and isolated. Precision and Recall are combined for calculating the F1 score. The equations for calculating Precision, Recall, and F1 Score are as follows<br />
<br />
<div align="center"> [[File:Equations.PNG|Equations for the calculation of BERTScore.]] </div><br />
<br />
<br />
=== Importance Weighting (optional) ===<br />
In some cases, rare words can be highly indicative of sentence similarity. Therefore, Inverse Document Frequency (idf) can be used with the above equations of the BERTScore. This is optional and depending on the domain of the text and the available data it may or may not benefit the final results. Thus understanding more about Importance Weighing is an open area of research.<br />
<br />
=== Baseline Rescaling ===<br />
Rescaling is done only to increase the human readability of the score. In theory, cosine similarity values are between -1 and 1 but practically they are confined in a much smaller range. A value b computed using Common Crawl monolingual datasets is used to linearly rescale the BERTScore. The rescaled recall <math> \hat{R}_{BERT} </math> is given by<br />
<div align="center"> [[File:Equation2.PNG|Equation for the rescaled BERTScore.]] </div><br />
Similarly, <math> P_{BERT} </math> and <math> F_{BERT} </math> are rescaled as well.<br />
<br />
== Experiment & Results ==<br />
The authors have experimented with different pre-trained contextual embedding models like BERT, RoBERTa, etc, and reported the best performing model results. The evaluation has been done on Machine Translation and Image Captioning tasks. <br />
<br />
=== Machine Translation ===<br />
The metric evaluation dataset consists of 149 translation systems, gold references, and two types of human judgments, namely, Segment-level human judgments and System-level human judgments. The former assigns a score to each reference candidate pair and the latter associates a single score for the whole system. Segment-level outputs for BERTScore are calculated as explained in the previous section on architecture and the System-level outputs are calculated by taking an average of BERTScore for every reference-candidate pair. Absolute Pearson Correlation <math> \lvert \rho \rvert </math> and Kendall rank correlation <math> \tau </math> are used for calculating metric quality, Williams test <sup> [1] </sup> for significance of <math> \lvert \rho \rvert </math> and Graham & Baldwin <sup> [2] </sup> methods for calculating the bootstrap resampling of <math> \tau </math>. The authors have also created hybrid systems by randomly sampling one candidate sentence for each reference sentence from one of the systems. This increases the volume of systems for System-level experiments. Further, the authors have also randomly selected 100 systems out of 10k hybrid systems for ranking them using automatic metrics. They have repeated this process multiple times and generated Hits@1, which contains the percentage of the metric ranking agreeing with human ranking on the best system. <br />
<br />
<div align="center"> '''The following 4 tables show the result of the experiments mentioned above.''' </div> <br><br />
<br />
<div align="center"> [[File:Table1_BERTScore.PNG|700px| Table1 Machine Translation]] [[File:Table2_BERTScore.PNG|700px| Table2 Machine Translation]] </div><br />
<div align="center"> [[File:Table3_BERTScore.PNG|700px| Table3 Machine Translation]] [[File:Table4_BERTScore.PNG|700px| Table4 Machine Translation]] </div><br />
<br />
In all 4 tables, we can see that BERTScore is consistently a top performer. It also gives a large improvement over the current state-of-the-art BLEU score. In to-English translation, RUSE shows competitive results but it is a learned metric technique and requires costly human judgments as supervision.<br />
<br />
=== Image Captioning ===<br />
For Image Captioning, human judgment for 12 submission entries from the COCO 2015 Captioning Challenge is used. As per Cui et al. (2018) <sup> [3] </sup>, Pearson Correlation with two System-Level metrics is calculated. The metrics are the percentage of captions better or equal to human captions (M1) and the percentage of captions that are indistinguishable from human captions (M2). There are approximately 5 reference captions and the BERTScore is taken to be the maximum of all the BERTScores individually with each reference caption. BERTScore is compared with 8 task-agnostic metrics and 2 task-specific metrics. <br />
<br />
<div align="center"> [[File:Table5_BERTScore.PNG|450px| Table5 Image Captioning]] </div><br />
<br />
<div align="center"> '''Table 5: Pearson correlation on the 2015 COCO Captioning Challenge.''' </div><br />
<br />
BERTScore is again a top performer and n-gram metrics like BLEU show a weak correlation with human judgments. For this task, importance weighting shows significant improvement depicting the importance of content words. <br />
<br />
'''Speed:''' The time taken for calculating BERTScore is not significantly higher than BLEU. For example, with the same hardware, the Machine Translation test on BERTScore takes 15.6 secs compared to 5.4 secs for BLEU. The time range is essentially small and thus the difference is marginal.<br />
<br />
== Robustness Analysis ==<br />
The authors tested BERTScore's robustness using two adversarial paraphrase classification datasets, QQP and PAWS. The table below summarized the result. Most metrics have good performance on QQP, but their performance drop significantly on PAWS. Conversely, BERTScore performs competitively on PAWS, which suggests BERTScore is better at distinguishing harder adversarial examples.<br />
<br />
[File: bertscore.png | 500px]<br />
== Source Code == <br />
The code for this paper is available at [https://github.com/Tiiiger/bert_score BERTScore].<br />
<br />
== Critique & Future Prospects==<br />
A text evaluation metric BERTScore is proposed which outperforms the previous approaches because of its capacity to use contextual embeddings for evaluation. It is simple and easy to use. BERTScore is also more robust than previous approaches. This is shown by the experiments carried on the datasets consisting of paraphrased sentences. There are variants of BERTScore depending upon the contextual embedding model, use of importance weighting, and the evaluation metric (Precision, Recall, or F1 score). <br />
<br />
The main reason behind the success of BERTScore is the use of contextual embeddings. The remaining architecture is straightforward in itself. There are some word embedding models that use complex metrics for calculating similarity. If we try to use those models along with contextual embeddings instead of word embeddings, they might result in more reliable performance than the BERTScore.<br />
<br />
<br />
The paper was quite interesting, but it is obvious that they lack technical novelty in their proposed approach. Their method is a natural application of BERT along with traditional cosine similarity measures and precision, recall, F1-based computations, and simple IDF-based importance weighting.<br />
<br />
== References ==<br />
<br />
[1] Evan James Williams. Regression analysis. wiley, 1959.<br />
<br />
[2] Yvette Graham and Timothy Baldwin. Testing for significance of increased correlation with human judgment. In EMNLP, 2014.<br />
<br />
[3] Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge J. Belongie. Learning to evaluate image captioning. In CVPR, 2018.<br />
<br />
[4] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.<br />
<br />
[5] Qingsong Ma, Ondrej Bojar, and Yvette Graham. Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance. In WMT, 2018.<br />
<br />
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.<br />
<br />
[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. arXiv, abs/1907.11692, 2019b.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:bertscore.png&diff=48636File:bertscore.png2020-12-01T04:55:45Z<p>X46yan: </p>
<hr />
<div></div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT&diff=48632BERTScore: Evaluating Text Generation with BERT2020-12-01T04:37:26Z<p>X46yan: </p>
<hr />
<div>== Presented by == <br />
Gursimran Singh<br />
<br />
== Introduction == <br />
In recent times, various machine learning approaches for text generation have gained popularity. This paper aims to develop an automatic metric that will judge the quality of the generated text. Commonly used state of the art metrics either use n-gram approaches or word embeddings for calculating the similarity between the reference and the candidate sentence. BertScore, on the other hand, calculates the similarity using contextual embeddings. BertScore basically addresses two common pitfalls in n-gram-based metrics. Firstly, the n-gram models fail to robustly match paraphrases which leads to performance underestimation when semantically-correct phrases are penalized because of their difference from the surface form of the reference. On the other hand in BertScore, the similarity is computed using contextualized token embeddings, which have been shown to be effective for paraphrase detection. Secondly, n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes. In contrast, contextualized embeddings capture distant dependencies and ordering effectively. The authors of the paper have carried out various experiments in Machine Translation and Image Captioning to show why BertScore is more reliable and robust than the previous approaches.<br />
<br />
''' Word versus Context Embeddings '''<br />
<br />
Both models aim to reduce the sparseness invoked by a bag of words (BoW) representation of text due to the high dimensional vocabularies. Both methods create embeddings of a dimensionality much lower than sparse BoW and aim to capture semantics and context. Word embeddings differ in that they will be deterministic as when given a word embedding model will always produce the same embedding, regardless of the surrounding words. However, contextual embeddings will create different embeddings for a word depending on the surrounding words in the given text.<br />
<br />
== Previous Work ==<br />
Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences. <br />
The most popular n-Gram Matching metric is BLEU. It follows the underlying principle of n-Gram matching and its uniqueness comes from three main factors. <br><br />
• Each n-Gram is matched at most once. <br><br />
• The total of exact-matches is accumulated for all reference candidate pairs and divided by the total number of <math>n</math>-grams in all candidate sentences. <br><br />
• Very short candidates are restricted. <br><br />
<br />
Further BLEU is generally calculated for multiple <math>n</math>-grams and averaged geometrically.<br />
n-Gram approaches also include METEOR, NIST, ΔBLEU, etc.<br />
<br />
Most of these methods utilize or slightly modify the exact match precision (Exact-<math>P_n</math>) and recall (Exact-<math>R_n</math>) scores. These scores can be formalized as follows:<br />
<br />
Exact- <math> P_n = \frac{\sum_{w \ in S^{n}_{ \hat{x} }} \mathbb{I}[w \in S^{n}_{x}]}{S^{n}_{\hat{x}}} </math> <br />
<br />
Exact- <math> R_n = \frac{\sum_{w \ in S^{n}_{x}} \mathbb{I}[w \in S^{n}_{\hat{x}}]}{S^{n}_{x}} </math> <br />
<br />
Here <math>S^{n}_{x}</math> and <math>S^{n}_{\hat{x}}</math> are lists of token <math>n</math>-grams in the reference <math>x</math> and candidate <math>\hat{x}</math> sentences respectively.<br />
<br />
Other categories include Edit-distance-based Metrics, Embedding-based metrics, and Learned Metrics. Most of these techniques do not capture the context of a word in the sentence. Moreover, Learned Metric approaches also require costly human judgments as supervision for each dataset.<br />
<br />
== Motivation ==<br />
The <math>n</math>-gram approaches like BLEU do not capture the positioning and the context of the word and simply rely on exact matching for evaluation. Consider the following example that shows how BLEU can result in incorrect judgment. <br><br />
Reference: people like foreign cars <br><br />
Candidate 1: people like visiting places abroad <br><br />
Candidate 2: consumers prefer imported cars<br />
<br />
BLEU gives a higher score to Candidate 1 as compared to Candidate 2. This undermines the performance of text generation models since contextually correct sentences are penalized. In contrast, some semantically different phrases are scored higher just because they are closer to the surface form of the reference sentence. <br />
<br />
On the other hand, BERTScore computes similarity using contextual token embeddings. It helps in detecting semantically correct paraphrased sentences. It also captures the cause and effect relationship (A gives B in place of B gives A) that the BLEU score isn't detected.<br />
<br />
== BERTScore Architecture ==<br />
Fig 1 summarizes the steps for calculating the BERTScore. Next, we will see details about each step. Here, the reference sentence is given by <math> x = ⟨x1, . . . , xk⟩ </math> and candidate sentence <math> \hat{x} = ⟨\hat{x1}, . . . , \hat{xl}⟩. </math> <br><br />
<br />
<div align="center"> [[File:Architecture_BERTScore.PNG|Illustration of the computation of BERTScore.]] </div><br />
<div align="center">'''Fig 1'''</div><br />
<br />
=== Token Representation ===<br />
Reference and candidate sentences are represented using contextual embeddings. Word embedding techniques inspire this but in contrast to word embeddings, the contextual embedding of a word depends upon the surrounding words in the sentence. These contextual embeddings are calculated using BERT and other similar models which utilize self-attention and nonlinear transformations.<br />
<br />
<div align="center"> [[File:Pearsson_corr_contextual_emb.PNG|Pearson Correlation for Contextual Embedding]] </div><br />
<div align="center">'''Fig 2'''</div><br />
<br />
=== Cosine Similarity ===<br />
Pairwise cosine similarity is calculated between each token <math> x_{i} </math> in reference sentence and <math> \hat{x}_{j} </math> in candidate sentence. Prenormalized vectors are used, therefore the pairwise similarity is given by <math> x_{i}^T \hat{x_{i}}. </math><br />
<br />
=== BERTScore ===<br />
<br />
Each token in x is matched to the most similar token in <math> \hat{x} </math> and vice-versa for calculating Recall and Precision respectively. The matching is greedy and isolated. Precision and Recall are combined for calculating the F1 score. The equations for calculating Precision, Recall, and F1 Score are as follows<br />
<br />
<div align="center"> [[File:Equations.PNG|Equations for the calculation of BERTScore.]] </div><br />
<br />
<br />
=== Importance Weighting (optional) ===<br />
In some cases, rare words can be highly indicative of sentence similarity. Therefore, Inverse Document Frequency (idf) can be used with the above equations of the BERTScore. This is optional and depending on the domain of the text and the available data it may or may not benefit the final results. Thus understanding more about Importance Weighing is an open area of research.<br />
<br />
=== Baseline Rescaling ===<br />
Rescaling is done only to increase the human readability of the score. In theory, cosine similarity values are between -1 and 1 but practically they are confined in a much smaller range. A value b computed using Common Crawl monolingual datasets is used to linearly rescale the BERTScore. The rescaled recall <math> \hat{R}_{BERT} </math> is given by<br />
<div align="center"> [[File:Equation2.PNG|Equation for the rescaled BERTScore.]] </div><br />
Similarly, <math> P_{BERT} </math> and <math> F_{BERT} </math> are rescaled as well.<br />
<br />
== Experiment & Results ==<br />
The authors have experimented with different pre-trained contextual embedding models like BERT, RoBERTa, etc, and reported the best performing model results. The evaluation has been done on Machine Translation and Image Captioning tasks. <br />
<br />
=== Machine Translation ===<br />
The metric evaluation dataset consists of 149 translation systems, gold references, and two types of human judgments, namely, Segment-level human judgments and System-level human judgments. The former assigns a score to each reference candidate pair and the latter associates a single score for the whole system. Segment-level outputs for BERTScore are calculated as explained in the previous section on architecture and the System-level outputs are calculated by taking an average of BERTScore for every reference-candidate pair. Absolute Pearson Correlation <math> \lvert \rho \rvert </math> and Kendall rank correlation <math> \tau </math> are used for calculating metric quality, Williams test <sup> [1] </sup> for significance of <math> \lvert \rho \rvert </math> and Graham & Baldwin <sup> [2] </sup> methods for calculating the bootstrap resampling of <math> \tau </math>. The authors have also created hybrid systems by randomly sampling one candidate sentence for each reference sentence from one of the systems. This increases the volume of systems for System-level experiments. Further, the authors have also randomly selected 100 systems out of 10k hybrid systems for ranking them using automatic metrics. They have repeated this process multiple times and generated Hits@1, which contains the percentage of the metric ranking agreeing with human ranking on the best system. <br />
<br />
<div align="center"> '''The following 4 tables show the result of the experiments mentioned above.''' </div> <br><br />
<br />
<div align="center"> [[File:Table1_BERTScore.PNG|700px| Table1 Machine Translation]] [[File:Table2_BERTScore.PNG|700px| Table2 Machine Translation]] </div><br />
<div align="center"> [[File:Table3_BERTScore.PNG|700px| Table3 Machine Translation]] [[File:Table4_BERTScore.PNG|700px| Table4 Machine Translation]] </div><br />
<br />
In all 4 tables, we can see that BERTScore is consistently a top performer. It also gives a large improvement over the current state-of-the-art BLEU score. In to-English translation, RUSE shows competitive results but it is a learned metric technique and requires costly human judgments as supervision.<br />
<br />
=== Image Captioning ===<br />
For Image Captioning, human judgment for 12 submission entries from the COCO 2015 Captioning Challenge is used. As per Cui et al. (2018) <sup> [3] </sup>, Pearson Correlation with two System-Level metrics is calculated. The metrics are the percentage of captions better or equal to human captions (M1) and the percentage of captions that are indistinguishable from human captions (M2). There are approximately 5 reference captions and the BERTScore is taken to be the maximum of all the BERTScores individually with each reference caption. BERTScore is compared with 8 task-agnostic metrics and 2 task-specific metrics. <br />
<br />
<div align="center"> [[File:Table5_BERTScore.PNG|450px| Table5 Image Captioning]] </div><br />
<br />
<div align="center"> '''Table 5: Pearson correlation on the 2015 COCO Captioning Challenge.''' </div><br />
<br />
BERTScore is again a top performer and n-gram metrics like BLEU show a weak correlation with human judgments. For this task, importance weighting shows significant improvement depicting the importance of content words. <br />
<br />
'''Speed:''' The time taken for calculating BERTScore is not significantly higher than BLEU. For example, with the same hardware, the Machine Translation test on BERTScore takes 15.6 secs compared to 5.4 secs for BLEU. The time range is essentially small and thus the difference is marginal.<br />
<br />
== Source Code == <br />
The code for this paper is available at [https://github.com/Tiiiger/bert_score BERTScore].<br />
<br />
== Critique & Future Prospects==<br />
A text evaluation metric BERTScore is proposed which outperforms the previous approaches because of its capacity to use contextual embeddings for evaluation. It is simple and easy to use. BERTScore is also more robust than previous approaches. This is shown by the experiments carried on the datasets consisting of paraphrased sentences. There are variants of BERTScore depending upon the contextual embedding model, use of importance weighting, and the evaluation metric (Precision, Recall, or F1 score). <br />
<br />
The main reason behind the success of BERTScore is the use of contextual embeddings. The remaining architecture is straightforward in itself. There are some word embedding models that use complex metrics for calculating similarity. If we try to use those models along with contextual embeddings instead of word embeddings, they might result in more reliable performance than the BERTScore.<br />
<br />
<br />
The paper was quite interesting, but it is obvious that they lack technical novelty in their proposed approach. Their method is a natural application of BERT along with traditional cosine similarity measures and precision, recall, F1-based computations, and simple IDF-based importance weighting.<br />
<br />
== References ==<br />
<br />
[1] Evan James Williams. Regression analysis. wiley, 1959.<br />
<br />
[2] Yvette Graham and Timothy Baldwin. Testing for significance of increased correlation with human judgment. In EMNLP, 2014.<br />
<br />
[3] Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge J. Belongie. Learning to evaluate image captioning. In CVPR, 2018.<br />
<br />
[4] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.<br />
<br />
[5] Qingsong Ma, Ondrej Bojar, and Yvette Graham. Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance. In WMT, 2018.<br />
<br />
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.<br />
<br />
[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. arXiv, abs/1907.11692, 2019b.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48608Roberta2020-12-01T03:09:32Z<p>X46yan: </p>
<hr />
<div>= RoBERTa: A Robustly Optimized BERT Pretraining Approach =<br />
== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but it remains challenging to determine which parts the methods have the most contribution. This paper proposed Roberta, which replicates BERT pretraining, and investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized as follows. (1) they modified some BERT design choices and training schemes. (2) they used a new set of datasets. These 2 modification categories help them improve performance on downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modeling (MLM) and next sentence prediction (NSP) as their objectives. The MLM objective randomly selects some of the tokens in the input sequence and replaces them with the special token [MASK]. Then they try to predict these tokens based on the surrounding information. NSP employs a binary classification loss for the prediction of whether the two sentences are adjacent to each other or not. It is noteworthy that, while selecting positive next sentences is trivial, generating negative ones is often much more difficult. Originally, they used a random next sentence as a negative. They used the Adam optimizer with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE, and showed their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modeling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps. To extend this single static mask, the authors duplicated training data 10 times so that each sequence was masked in 10 different ways. The model was trained on those data for 40 epochs, with each sequence with the same mask being used 4 times.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objective. They tried different settings to show it would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document chosen at random with a probability of 0.5, and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: This is the same as the segment-pair representation but with pairs of sentences instead of segments. However, the total length of sequences here would be a lot less than 512. Hence, a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, with the difference that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. However, the authors chose to use FULL-SENTENCES for convenience sake, since the DOC-SENTENCES results in variable batch sizes.<br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes. The result suggested the original BERT batch size was too small. The authors used 8k batch size in the remainder of their experiments. <br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE. BPE is a hybrid between character and word level modeling based on sub-word units. The authors also used a vocabulary size of 50k rather than 30k at the original BERT implementation did, this increases the total parameters by approximately 15M to 20M for the BERT based and BERT large respectively. This change actually results in slight degradation of end-task performance in some cases, however, the authors preferred having the ability to universally encode text without introducing the "unknown" token.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all the aforementioned modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
In conjunction with the modifications of dynamic masking, full-sentences whiteout an NLP loss, large mini-batches, and a larger byte-level BPE.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
RoBERTa has outperformed state-of-the-art algorithms in almost all GLUE tasks, including ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the below table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
<br />
== Conclusion ==<br />
In conclusion, they basically said the reasons why they make gains may be questionable, and if you decently pre-train BERT, you will achieve the same performances as RoBERTa.<br />
<br />
== The comparison at a glance ==<br />
<br />
[[File:comparison_roberta.png|500px|center|thumb|from [8]]]<br />
<br />
== Critique ==<br />
While the results are outstanding and appreciable (reasonably due to using more data and resources), the technical novelty contribution of the paper is marginally incremental as the architecture is largely unchanged from BERT.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/pytorch/fairseq/tree/master/examples/roberta RoBERTa]. The original repository for RoBERTa is in PyTorch. In case you are a TensorFlow user, you would be interested in [https://github.com/Yaozeng/roberta-tf PyTorch to TensorFlow] repository as well.<br />
<br />
==Refrences == <br />
[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In North American Association for Computational Linguistics (NAACL).<br />
<br />
[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.<br />
<br />
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).<br />
<br />
[4] Guillaume Lample and Alexis Conneau. 2019. Cross lingual language model pretraining. arXiv preprint arXiv:1901.07291.<br />
<br />
[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.<br />
<br />
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.<br />
<br />
[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).<br />
<br />
[8] BERT, RoBERTa, DistilBERT, XLNet - which one to use?<br />
Suleiman Khan<br />
[https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8/ link]</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pre-Training_Tasks_For_Embedding-Based_Large-Scale_Retrieval&diff=47091Pre-Training Tasks For Embedding-Based Large-Scale Retrieval2020-11-27T22:26:51Z<p>X46yan: </p>
<hr />
<div><br />
==Contributors==<br />
Pierre McWhannel wrote this summary with editorial and technical contributions from STAT 946 fall 2020 classmates. The summary is based on the paper "Pre-Training Tasks for Embedding-Based Large-Scale Retrieval" which was presented at ICLR 2020. The authors of this paper are Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, Sanjiv Kumar [1].<br />
<br />
==Introduction==<br />
The problem domain which the paper is positioned within is large-scale query-document retrieval. This problem is: given a query/question to collect relevant documents which contain the answer(s) and identify the span of words which contain the answer(s). This problem is often separated into two steps 1) a retrieval phase which reduces a large corpus to a much smaller subset and 2) a reader which is applied to read the shortened list of documents and score them based on their ability to answer the query and identify the span containing the answer, these spans can be ranked according to some scoring function. In the setting of open-domain question answering the query, <math>q</math> is a question and the documents <math>d</math> represent a passage which may contain the answer(s). The focus of this paper is on the retriever question answering (QA), in this setting a scoring function <math>f: \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R} </math> is utilized to map <math>(q,d) \in \mathcal{X} \times \mathcal{Y}</math> to a real value which represents the score of how relevant the passage is to the query. We desire high scores for relevant passages and and low scores otherwise and this is the job of the retriever. <br />
<br />
<br />
Certain characteristics desired of the retriever are high recall since the reader will only be applied to a subset meaning there is an an opportunity to identify false positive but not false negatives. The other characteristic desired of a retriever is low latency meaning it is computationally efficient. The popular approach to meet these heuristics has been the BM-25 [2] which relies on token-based matching between two high-dimensional and sparse vectors representing <math>q</math> and <math>d</math>. This approach performs well and retrieval can be performed in time sublinear to the number of passages with inverter indexing. The downfall of this type of algorithm is its inability to be optimized for a specific task. An alternate option is BERT or transformer based models with cross attention between query and passage pairs which can be optimized for a specific task. However, these models suffer from latency as the retriever needs to process all the pairwise combinations of a query with all passages. The next type of algorithm is an embedding-based model that jointly embeds the query and passage in the same embedding space and then uses measures such as cosine distance, inner product, or even the Euclidean distance in this space to get a score. The authors suggest the two-tower models can capture deeper semantic relationships in addition to being able to be optimized for specific tasks. This model is referred to as the "two-tower retrieval model", since it has two transformer based models where one embeds the query <math>\phi{(\cdot)}</math> and the other the passage <math>\psi{(\cdot)}</math>, then the embeddings can be scored by <math>f(q,d) = \langle \phi{(q)},\psi{(d)} \rangle \in \mathbb{R} </math>. This model can avoid the latency problems of a cross-layer model by pre-computing <math>\psi{(d)}</math> before hand and then by utilizing efficient approximate nearest neighbor search algorithms in the embedding space to find the nearest documents.<br />
<br />
<br />
These two-tower retrieval models often use BERT with pretrained weights, in BERT the pre-training tasks are masked-LM (MLM) and next sentence prediction (NSP). The authors note that the research of pre-training tasks that improve the performance of two-tower retrieval models is an unsolved research problem. This is exactly what the authors have done in this paper. That is, to develop pre-training tasks for the two-tower model based on heuristics and then validated by experimental results. The pre-training tasks suggested are Inverse Cloze Task (ICT), Body First Search (BFS), Wiki Link Prediction (WLP), and combining all three. The authors used the Retrieval Question-Answering ('''ReQA''') benchmark and used two datasets '''SQuAD''' and '''Natural Questions''' for training and evaluating their models.<br />
<br />
<br />
<br />
===Contributions of this paper===<br />
*The two-tower transformer model with proper pre-training can significantly outperform the widely used BM-25 Algorithm.<br />
*Paragraph-level pre-training tasks such as ICT, BFS, and WLP hugely improve the retrieval quality, whereas the most widely used pre-training task (the token-level masked-LM) gives only marginal gains.<br />
*The two-tower models with deep transformer encoders benefit more from paragraph-level pre-training compared to its shallow bag-of-word counterpart (BoW-MLP)<br />
<br />
==Background on Two-Tower Retrieval Models==<br />
This section is to present the two-tower model in more detail. As seen previously <math>q \in \mathcal{X}</math>, <math> d \in \mathcal{y}</math> are the query and document or passage respectively. The two-tower retrieval model consists of two encoders: <math>\phi: \mathcal{X} \rightarrow \mathbb{R}^k</math> and <math>\psi: \mathcal{Y} \rightarrow \mathbb{R}^k</math>. The tokens in <math>\mathcal{X},\mathcal{Y}</math> are a sequence of tokens representing the query and document respectively. The scoring function is <math> f(q,d) = \langle \phi{(q)},\psi{(d)} \rangle \in \mathbb{R} </math>. The cross-attention BERT-style model would have <math> f_{\theta,w}(q,d) = \psi_{\theta}{(q \oplus d)^{T}}w </math> as a scoring function and <math> \oplus </math> signifies the concatenation of the sequences of <math>q</math> and <math>d</math>, <math> \theta </math> are the parameters of the cross-attention model. The architectures of these two models can be seen below in figure 1.<br />
<br />
<br/><br />
[[File:two-tower-vs-cross-layer.PNG|center|Figure 1: end-to-end model]]<br />
<br/><br />
<br />
===Comparisons===<br />
<br />
====Inference====<br />
In terms of inference the two-tower architecture has a distinct advantage as being easily seen from the architecture. Since the query and document are embedded independently, <math> \psi{(d)}</math> can be pre computed. This is an important advantage as it makes the two-tower method feasible. As an example the authors state the inner product between 100 query embeddings and 1 million document embeddings only takes hundreds of milliseconds on CPUs, however, for the equivalent calculation for a cross-attention-model it takes hours on GPUs. In addition they remark the retrieval of the relevant documents can be done in sublinear time with respect <math> |\mathcal{Y}| </math> using a maximum inner product (MIPS) algorithm with little loss in recall.<br />
<br />
====Learning====<br />
In terms of learning the two-tower model compared to BM-25 has a unique advantage of being possible to be fine-tuned for downstream tasks. This problem is within the paradigm of metric learning as the scoring function can be passed as input to an objective function of the log likelihood <math> \max_{\theta} \space log(p_{\theta}(d|q)) </math>. The conditional probability is represented by a SoftMax and then can be rewritten as:<br />
<br />
<br />
<math> p_{\theta}(d|q) = \frac{exp(f_{\theta}(q,d))}{\sum_{d' \in D}} exp(f_{\theta}(q,d'))</math><br />
<br />
<br />
Here <math> \mathcal{D} </math> is the set of all possible documents, the denominator is a summation over <math>\mathcal{D} </math>. As is customary they utilized a sampled SoftMax technique to approximate the full-SoftMax. The technique they have employed is by using a small subset of documents in the current training batch, while also using a proper correcting term to ensure the unbiasedness of the partition function.<br />
<br />
====Why do we need pre-training then?====<br />
Although the two-tower model can be fine-tuned for a downstream task getting sufficiently labelled data is not always possible. This is an opportunity to look for better pre-training tasks as the authors have done. The authors suggest getting a set of pre-training task <math> \mathcal{T} </math> with labelled positive pairs of queries of documents/passages and then afterward fine-tuning on the limited amount of data for the downstream task.<br />
<br />
==Pre-Training Tasks==<br />
The authors suggest three pre-training tasks for the encoders of the two-tower retriever, these are ICT, BFS, and WLP. Keep in mind that ICT is not a newly proposed task. These pre-training tasks are developed in accordance to two heuristics they state and our paragraph-level pre-training tasks in contrast to more common sentence-level or word-level tasks:<br />
<br />
# The tasks should capture a different resolutions of semantics between query and document. The semantics can be the local context within a paragraph, global consistency within a document, and even semantic relation between two documents.<br />
# Collecting the data for the pre-training tasks should require as little human intervention as possible and be cost-efficient.<br />
<br />
All three of the following tasks can be constructed in a cost-efficient and with minimal human intervention by utilizing Wikipedia articles. In figure 2 below the dashed line surrounding the block of text is to be regarded as the document <math> d </math> and the rest of the circled portions of text make up the three queries <math>q_1, q_2, q_3 </math> selected for each task.<br />
<br />
<br/><br />
[[File:pre-train-tasks.PNG|center|Figure 1: end-to-end model]]<br />
<br/><br />
<br />
===Inverse Cloze Task (ICT)===<br />
This task is given a passage <math> p </math> consisting of <math> n </math> sentences <math>s_{i}, i=1,..,n </math>. <math> s_{i} </math> is randomly sampled and used as the query <math> q </math> and then the document is constructed to be based on the remaining <math> n-1 </math> sentences. This is utilized to meet the local context within a paragraph of the first heuristic. In figure 2 <math> q_1 </math> is an example of a potential query for ICT based on the selected document <math> d </math>.<br />
<br />
===Body First Search (BFS)===<br />
BFS is <math> (q,d) </math> pairs are created by randomly selecting a sentence <math> q_2 </math> and setting it as the query from the first section of a Wikipedia page which contains the selected document <math> d </math>. This is utilized to meet the global context consistency within a document as the first section in Wikipedia generally is an overview of the whole page and they anticipate it to contain information central to the topic. In figure 2 <math> q_2 </math> is an example of a potential query for BFS based on the selected document <math> d </math>.<br />
<br />
===Wiki Link Prediction (WLP)===<br />
The third task is selecting a hyperlink that is within the selected document <math> d </math>, as can be observed in figure 2 "machine learning" is a valid option. The query <math> q_3 </math> will then be a random sentence on the first section of the page where the hyperlinks redirects to. This is utilized to provide the semantic relation between documents.<br />
<br />
<br />
Additionally Masked LM (MLM) is considered during the experimentations which again is one of the primary pre-training tasks used in BERT.<br />
<br />
==Experiments and Results==<br />
<br />
===Training Specifications===<br />
<br />
*Each tower is constructed from the 12 layers BERT-base model.<br />
*Embedding is achieved by applying a linear layer on the [CLS] token output to get an embedding dimension of 512.<br />
*Sequence lengths of 64 and 288 respectively for the query and document encoders.<br />
*Pre-trained on 32 TPU v3 chips for 100k step with Adam optimizer learning rate of <math> 1 \times 10^{-4} </math> with 0.1 warm-up ratio, followed by linear learning rate decay and batch size of 8192. (~2.5 days to complete)<br />
*Fine tuning on downstream task <math> 5 \times 10^{-5} </math> with 2000 training steps and batch size 512.<br />
*Tokenizer is WordPiece.<br />
<br />
===Pre-Training Tasks Setup===<br />
<br />
*MLM, ICT, BFS, and WLP are used.<br />
*Various combinations of the aforementioned tasks are tested.<br />
*Tasks define <math> (q,d) </math> pairings.<br />
*ICT's document <math> d </math> has the article title and passage separated by a [SEP] symbol as input to the document encoder.<br />
<br />
Below table 1 shows some statistics for the datasets constructed for the ICT, BFS, and WLP tasks. Token counts considered after the tokenizer is applied.<br />
<br />
[[File: wiki-pre-train-statistics.PNG | center]]<br />
<br />
===Downstream Tasks Setup===<br />
Retrieval Question-Answering ('''ReQA''') benchmark used.<br />
<br />
*Datasets: '''SQuAD''' and '''Natural Questions'''.<br />
*Each entry of data from datasets is <math> (q,a,p) </math>, where each element is, query <math> q </math>, answer <math> a </math>, and passage <math> p </math> containing the answer, respectively.<br />
*Authors split the passage to sentences <math>s_i</math> where <math> i=1,...n </math> and <math> n </math> is the number of sentences in the passage.<br />
*The problem is then recast for a given query to retrieve the correct sentence-passage <math> (s,p) </math> from all candidates for all passages split in this fashion.<br />
*They remark their reformulation makes it a more difficult problem.<br />
<br />
Note: Experiments done on '''ReQA''' benchmarks are not entirely open-domain QA retrieval as the candidate <math> (s,p) </math> only cover the training set of QA dataset instead of entire Wikipedia articles. There is some experiments they did with augmented the dataset as will be seen in table 6.<br />
<br />
Below table 2 shows statistics of the datasets used for the downstream QA task.<br />
<br />
<br />
[[File:data-statistics-30.PNG | center]]<br />
<br />
===Evaluation===<br />
<br />
*3 train/test splits: 1%/99%, 5%/95%, and 80%/20%.<br />
*10% of training set is used as a validation set for hyper-parameter tuning.<br />
*Training never has seen the test queries in the test pairs of <math>(q,d).</math><br />
*Evaluation metric is recall@k since the goal is to capture positives in the top-k results.<br />
<br />
<br />
===Results===<br />
<br />
Table 3 shows the results on the '''SQuAD''' dataset, as observed the performance using the combination of all 3 pre-training tasks has the best performance, with the exception of R@1 for 1%/99% train/test split. BoW-MLP is used to justify the use of a transformer as the encoder since it is has more complexity, BoW-MLP here looks up uni-grams from an embedding table, aggregates the embeddings with average pooling, and passes them though a shallow two-layer ML network with <math> tanh </math> activation to generate the final 512-dimensional query/document embeddings.<br />
<br />
[[File:table3-30.PNG | center]]<br />
<br />
Table 4 shows the results on the '''Natural Questions''' dataset, as observed the performance using the combination of all 3 pre-training tasks has the best performance in all testing scenarios.<br />
<br />
[[File:table4-30.PNG | center]]<br />
<br />
====Ablation Study====<br />
The embedding-dimensions, # of layers in BERT and pre-training tasks are altered for this ablation study as seen in table 5.<br />
<br />
Interestingly ICT+BFS+WLP is only an absolute of 1.5% better than just ICT in the low-data regime. The authors infer this could be representative of when there is insufficient downstream data for training that more global pre-training tasks become increasingly beneficial, since BFS and WLP offer this.<br />
<br />
[[File:table5-30.PNG | center]]<br />
<br />
====Evaluation of Open-Domain Retrieval====<br />
Augment ReQA benchmark with large-scale (sentence, evidence passage) pairs extracted from general Wikipedia. This is added as one million external candidate pairs into the existing retrieval candidate set of the ReQA benchmark. Results here are fairly consistent with previous results.<br />
<br />
<br />
[[File:table6-30.PNG | center]]<br />
<br />
==Conclusion==<br />
<br />
The authors performed a comprehensive study and have concluded that properly designing paragraph-level pre-training tasks including ICT, BFS, and WLP for a two-tower transformer model can significantly outperform a BM-25 algorithm, which is an unsupervised method using weighted token-matching as a scoring function. The findings of this paper suggest that a two-tower transformer model with proper pre-training tasks can replace the BM25 algorithm used in the retrieval stage to further improve end-to-end system performance. The authors plan to test the pre-training tasks with a larger variety of encoder architectures and trying other corpora than Wikipedia and additionally trying pre-training in contrast to different regularization methods.<br />
<br />
==Critique==<br />
<br />
The authors of this paper suggest from their experiments that the combination of the 3 pretraining tasks ICT, BFS, and WLP yields the best results in general. However, they do acknowledge why two of the three pre-training tasks may not produce equally significant results as the three. From the ablation study and the small margin (1.5%) in which the three tasks achieved outperformed only using ICT in combination with observing ICT outperforming the three as seen in table 6 in certain scenarios. It seems plausible that a subset of two tasks could also potentially outperform the three or reach a similar performance. I believe this should be addressed in order to conclude three pre-training tasks are needed over two, though they never address this comparison I think it would make the paper more complete.<br />
<br />
The two-tower transformer model has two encoders, one for query encoding, and one for passage encoding. But in theory, queries and passages are all textual information. Do we really need two different encoders? Maybe using one encoder to encode both queries and passages can achieve competitive results. I'd like to see if any paper has ever covered this.<br />
<br />
[1] Wei-Cheng Chang, Felix X Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. Pre-training tasks for embedding-based large-scale retrieval. arXiv preprint arXiv:2002.03932, 2020.<br />
<br />
[2] Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pre-Training_Tasks_For_Embedding-Based_Large-Scale_Retrieval&diff=47090Pre-Training Tasks For Embedding-Based Large-Scale Retrieval2020-11-27T22:26:23Z<p>X46yan: </p>
<hr />
<div><br />
==Contributors==<br />
Pierre McWhannel wrote this summary with editorial and technical contributions from STAT 946 fall 2020 classmates. The summary is based on the paper "Pre-Training Tasks for Embedding-Based Large-Scale Retrieval" which was presented at ICLR 2020. The authors of this paper are Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, Sanjiv Kumar [1].<br />
<br />
==Introduction==<br />
The problem domain which the paper is positioned within is large-scale query-document retrieval. This problem is: given a query/question to collect relevant documents which contain the answer(s) and identify the span of words which contain the answer(s). This problem is often separated into two steps 1) a retrieval phase which reduces a large corpus to a much smaller subset and 2) a reader which is applied to read the shortened list of documents and score them based on their ability to answer the query and identify the span containing the answer, these spans can be ranked according to some scoring function. In the setting of open-domain question answering the query, <math>q</math> is a question and the documents <math>d</math> represent a passage which may contain the answer(s). The focus of this paper is on the retriever question answering (QA), in this setting a scoring function <math>f: \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R} </math> is utilized to map <math>(q,d) \in \mathcal{X} \times \mathcal{Y}</math> to a real value which represents the score of how relevant the passage is to the query. We desire high scores for relevant passages and and low scores otherwise and this is the job of the retriever. <br />
<br />
<br />
Certain characteristics desired of the retriever are high recall since the reader will only be applied to a subset meaning there is an an opportunity to identify false positive but not false negatives. The other characteristic desired of a retriever is low latency meaning it is computationally efficient. The popular approach to meet these heuristics has been the BM-25 [2] which relies on token-based matching between two high-dimensional and sparse vectors representing <math>q</math> and <math>d</math>. This approach performs well and retrieval can be performed in time sublinear to the number of passages with inverter indexing. The downfall of this type of algorithm is its inability to be optimized for a specific task. An alternate option is BERT or transformer based models with cross attention between query and passage pairs which can be optimized for a specific task. However, these models suffer from latency as the retriever needs to process all the pairwise combinations of a query with all passages. The next type of algorithm is an embedding-based model that jointly embeds the query and passage in the same embedding space and then uses measures such as cosine distance, inner product, or even the Euclidean distance in this space to get a score. The authors suggest the two-tower models can capture deeper semantic relationships in addition to being able to be optimized for specific tasks. This model is referred to as the "two-tower retrieval model", since it has two transformer based models where one embeds the query <math>\phi{(\cdot)}</math> and the other the passage <math>\psi{(\cdot)}</math>, then the embeddings can be scored by <math>f(q,d) = \langle \phi{(q)},\psi{(d)} \rangle \in \mathbb{R} </math>. This model can avoid the latency problems of a cross-layer model by pre-computing <math>\psi{(d)}</math> before hand and then by utilizing efficient approximate nearest neighbor search algorithms in the embedding space to find the nearest documents.<br />
<br />
<br />
These two-tower retrieval models often use BERT with pretrained weights, in BERT the pre-training tasks are masked-LM (MLM) and next sentence prediction (NSP). The authors note that the research of pre-training tasks that improve the performance of two-tower retrieval models is an unsolved research problem. This is exactly what the authors have done in this paper. That is, to develop pre-training tasks for the two-tower model based on heuristics and then validated by experimental results. The pre-training tasks suggested are Inverse Cloze Task (ICT), Body First Search (BFS), Wiki Link Prediction (WLP), and combining all three. The authors used the Retrieval Question-Answering ('''ReQA''') benchmark and used two datasets '''SQuAD''' and '''Natural Questions''' for training and evaluating their models.<br />
<br />
<br />
<br />
===Contributions of this paper===<br />
*The two-tower transformer model with proper pre-training can significantly outperform the widely used BM-25 Algorithm.<br />
*Paragraph-level pre-training tasks such as ICT, BFS, and WLP hugely improve the retrieval quality, whereas the most widely used pre-training task (the token-level masked-LM) gives only marginal gains.<br />
*The two-tower models with deep transformer encoders benefit more from paragraph-level pre-training compared to its shallow bag-of-word counterpart (BoW-MLP)<br />
<br />
==Background on Two-Tower Retrieval Models==<br />
This section is to present the two-tower model in more detail. As seen previously <math>q \in \mathcal{X}</math>, <math> d \in \mathcal{y}</math> are the query and document or passage respectively. The two-tower retrieval model consists of two encoders: <math>\phi: \mathcal{X} \rightarrow \mathbb{R}^k</math> and <math>\psi: \mathcal{Y} \rightarrow \mathbb{R}^k</math>. The tokens in <math>\mathcal{X},\mathcal{Y}</math> are a sequence of tokens representing the query and document respectively. The scoring function is <math> f(q,d) = \langle \phi{(q)},\psi{(d)} \rangle \in \mathbb{R} </math>. The cross-attention BERT-style model would have <math> f_{\theta,w}(q,d) = \psi_{\theta}{(q \oplus d)^{T}}w </math> as a scoring function and <math> \oplus </math> signifies the concatenation of the sequences of <math>q</math> and <math>d</math>, <math> \theta </math> are the parameters of the cross-attention model. The architectures of these two models can be seen below in figure 1.<br />
<br />
<br/><br />
[[File:two-tower-vs-cross-layer.PNG|center|Figure 1: end-to-end model]]<br />
<br/><br />
<br />
===Comparisons===<br />
<br />
====Inference====<br />
In terms of inference the two-tower architecture has a distinct advantage as being easily seen from the architecture. Since the query and document are embedded independently, <math> \psi{(d)}</math> can be pre computed. This is an important advantage as it makes the two-tower method feasible. As an example the authors state the inner product between 100 query embeddings and 1 million document embeddings only takes hundreds of milliseconds on CPUs, however, for the equivalent calculation for a cross-attention-model it takes hours on GPUs. In addition they remark the retrieval of the relevant documents can be done in sublinear time with respect <math> |\mathcal{Y}| </math> using a maximum inner product (MIPS) algorithm with little loss in recall.<br />
<br />
====Learning====<br />
In terms of learning the two-tower model compared to BM-25 has a unique advantage of being possible to be fine-tuned for downstream tasks. This problem is within the paradigm of metric learning as the scoring function can be passed as input to an objective function of the log likelihood <math> \max_{\theta} \space log(p_{\theta}(d|q)) </math>. The conditional probability is represented by a SoftMax and then can be rewritten as:<br />
<br />
<br />
<math> p_{\theta}(d|q) = \frac{exp(f_{\theta}(q,d))}{\sum_{d' \in D}} exp(f_{\theta}(q,d'))</math><br />
<br />
<br />
Here <math> \mathcal{D} </math> is the set of all possible documents, the denominator is a summation over <math>\mathcal{D} </math>. As is customary they utilized a sampled SoftMax technique to approximate the full-SoftMax. The technique they have employed is by using a small subset of documents in the current training batch, while also using a proper correcting term to ensure the unbiasedness of the partition function.<br />
<br />
====Why do we need pre-training then?====<br />
Although the two-tower model can be fine-tuned for a downstream task getting sufficiently labelled data is not always possible. This is an opportunity to look for better pre-training tasks as the authors have done. The authors suggest getting a set of pre-training task <math> \mathcal{T} </math> with labelled positive pairs of queries of documents/passages and then afterward fine-tuning on the limited amount of data for the downstream task.<br />
<br />
==Pre-Training Tasks==<br />
The authors suggest three pre-training tasks for the encoders of the two-tower retriever, these are ICT, BFS, and WLP. Keep in mind that ICT is not a newly proposed task. These pre-training tasks are developed in accordance to two heuristics they state and our paragraph-level pre-training tasks in contrast to more common sentence-level or word-level tasks:<br />
<br />
# The tasks should capture a different resolutions of semantics between query and document. The semantics can be the local context within a paragraph, global consistency within a document, and even semantic relation between two documents.<br />
# Collecting the data for the pre-training tasks should require as little human intervention as possible and be cost-efficient.<br />
<br />
All three of the following tasks can be constructed in a cost-efficient and with minimal human intervention by utilizing Wikipedia articles. In figure 2 below the dashed line surrounding the block of text is to be regarded as the document <math> d </math> and the rest of the circled portions of text make up the three queries <math>q_1, q_2, q_3 </math> selected for each task.<br />
<br />
<br/><br />
[[File:pre-train-tasks.PNG|center|Figure 1: end-to-end model]]<br />
<br/><br />
<br />
===Inverse Cloze Task (ICT)===<br />
This task is given a passage <math> p </math> consisting of <math> n </math> sentences <math>s_{i}, i=1,..,n </math>. <math> s_{i} </math> is randomly sampled and used as the query <math> q </math> and then the document is constructed to be based on the remaining <math> n-1 </math> sentences. This is utilized to meet the local context within a paragraph of the first heuristic. In figure 2 <math> q_1 </math> is an example of a potential query for ICT based on the selected document <math> d </math>.<br />
<br />
===Body First Search (BFS)===<br />
BFS is <math> (q,d) </math> pairs are created by randomly selecting a sentence <math> q_2 </math> and setting it as the query from the first section of a Wikipedia page which contains the selected document <math> d </math>. This is utilized to meet the global context consistency within a document as the first section in Wikipedia generally is an overview of the whole page and they anticipate it to contain information central to the topic. In figure 2 <math> q_2 </math> is an example of a potential query for BFS based on the selected document <math> d </math>.<br />
<br />
===Wiki Link Prediction (WLP)===<br />
The third task is selecting a hyperlink that is within the selected document <math> d </math>, as can be observed in figure 2 "machine learning" is a valid option. The query <math> q_3 </math> will then be a random sentence on the first section of the page where the hyperlinks redirects to. This is utilized to provide the semantic relation between documents.<br />
<br />
<br />
Additionally Masked LM (MLM) is considered during the experimentations which again is one of the primary pre-training tasks used in BERT.<br />
<br />
==Experiments and Results==<br />
<br />
===Training Specifications===<br />
<br />
*Each tower is constructed from the 12 layers BERT-base model.<br />
*Embedding is achieved by applying a linear layer on the [CLS] token output to get an embedding dimension of 512.<br />
*Sequence lengths of 64 and 288 respectively for the query and document encoders.<br />
*Pre-trained on 32 TPU v3 chips for 100k step with Adam optimizer learning rate of <math> 1 \times 10^{-4} </math> with 0.1 warm-up ratio, followed by linear learning rate decay and batch size of 8192. (~2.5 days to complete)<br />
*Fine tuning on downstream task <math> 5 \times 10^{-5} </math> with 2000 training steps and batch size 512.<br />
*Tokenizer is WordPiece.<br />
<br />
===Pre-Training Tasks Setup===<br />
<br />
*MLM, ICT, BFS, and WLP are used.<br />
*Various combinations of the aforementioned tasks are tested.<br />
*Tasks define <math> (q,d) </math> pairings.<br />
*ICT's document <math> d </math> has the article title and passage separated by a [SEP] symbol as input to the document encoder.<br />
<br />
Below table 1 shows some statistics for the datasets constructed for the ICT, BFS, and WLP tasks. Token counts considered after the tokenizer is applied.<br />
<br />
[[File: wiki-pre-train-statistics.PNG | center]]<br />
<br />
===Downstream Tasks Setup===<br />
Retrieval Question-Answering ('''ReQA''') benchmark used.<br />
<br />
*Datasets: '''SQuAD''' and '''Natural Questions'''.<br />
*Each entry of data from datasets is <math> (q,a,p) </math>, where each element is, query <math> q </math>, answer <math> a </math>, and passage <math> p </math> containing the answer, respectively.<br />
*Authors split the passage to sentences <math>s_i</math> where <math> i=1,...n </math> and <math> n </math> is the number of sentences in the passage.<br />
*The problem is then recast for a given query to retrieve the correct sentence-passage <math> (s,p) </math> from all candidates for all passages split in this fashion.<br />
*They remark their reformulation makes it a more difficult problem.<br />
<br />
Note: Experiments done on '''ReQA''' benchmarks are not entirely open-domain QA retrieval as the candidate <math> (s,p) </math> only cover the training set of QA dataset instead of entire Wikipedia articles. There is some experiments they did with augmented the dataset as will be seen in table 6.<br />
<br />
Below table 2 shows statistics of the datasets used for the downstream QA task.<br />
<br />
<br />
[[File:data-statistics-30.PNG | center]]<br />
<br />
===Evaluation===<br />
<br />
*3 train/test splits: 1%/99%, 5%/95%, and 80%/20%.<br />
*10% of training set is used as a validation set for hyper-parameter tuning.<br />
*Training never has seen the test queries in the test pairs of <math>(q,d).</math><br />
*Evaluation metric is recall@k since the goal is to capture positives in the top-k results.<br />
<br />
<br />
===Results===<br />
<br />
Table 3 shows the results on the '''SQuAD''' dataset, as observed the performance using the combination of all 3 pre-training tasks has the best performance, with the exception of R@1 for 1%/99% train/test split. BoW-MLP is used to justify the use of a transformer as the encoder since it is has more complexity, BoW-MLP here looks up uni-grams from an embedding table, aggregates the embeddings with average pooling, and passes them though a shallow two-layer ML network with <math> tanh </math> activation to generate the final 512-dimensional query/document embeddings.<br />
<br />
[[File:table3-30.PNG | center]]<br />
<br />
Table 4 shows the results on the '''Natural Questions''' dataset, as observed the performance using the combination of all 3 pre-training tasks has the best performance in all testing scenarios.<br />
<br />
[[File:table4-30.PNG | center]]<br />
<br />
====Ablation Study====<br />
The embedding-dimensions, # of layers in BERT and pre-training tasks are altered for this ablation study as seen in table 5.<br />
<br />
Interestingly ICT+BFS+WLP is only an absolute of 1.5% better than just ICT in the low-data regime. The authors infer this could be representative of when there is insufficient downstream data for training that more global pre-training tasks become increasingly beneficial, since BFS and WLP offer this.<br />
<br />
[[File:table5-30.PNG | center]]<br />
<br />
====Evaluation of Open-Domain Retrieval====<br />
Augment ReQA benchmark with large-scale (sentence, evidence passage) pairs extracted from general Wikipedia. This is added as one million external candidate pairs into the existing retrieval candidate set of the ReQA benchmark. Results here are fairly consistent with previous results.<br />
<br />
<br />
[[File:table6-30.PNG | center]]<br />
<br />
==Conclusion==<br />
<br />
The authors performed a comprehensive study and have concluded that properly designing paragraph-level pre-training tasks including ICT, BFS, and WLP for a two-tower transformer model can significantly outperform a BM-25 algorithm, which is an unsupervised method using weighted token-matching as a scoring function. The findings of this paper suggest that a two-tower transformer model with proper pre-training tasks can replace the BM25 algorithm used in the retrieval stage to further improve end-to-end system performance. The authors plan to test the pre-training tasks with a larger variety of encoder architectures and trying other corpora than Wikipedia and additionally trying pre-training in contrast to different regularization methods.<br />
<br />
==Critique==<br />
<br />
The authors of this paper suggest from their experiments that the combination of the 3 pretraining tasks ICT, BFS, and WLP yields the best results in general. However, they do acknowledge why two of the three pre-training tasks may not produce equally significant results as the three. From the ablation study and the small margin (1.5%) in which the three tasks achieved outperformed only using ICT in combination with observing ICT outperforming the three as seen in table 6 in certain scenarios. It seems plausible that a subset of two tasks could also potentially outperform the three or reach a similar performance. I believe this should be addressed in order to conclude three pre-training tasks are needed over two, though they never address this comparison I think it would make the paper more complete.<br />
<br />
The two-tower transformer model has two encoders, one for query encoding, and one for passage encoding. But in theory, queries and passages are all textual information. Do we really need two different encoders? Maybe using one encoder to encode both queries and passages can achieve competitive results. I'd like to see if any papers have ever covered this.<br />
<br />
[1] Wei-Cheng Chang, Felix X Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. Pre-training tasks for embedding-based large-scale retrieval. arXiv preprint arXiv:2002.03932, 2020.<br />
<br />
[2] Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pre-Training_Tasks_For_Embedding-Based_Large-Scale_Retrieval&diff=47084Pre-Training Tasks For Embedding-Based Large-Scale Retrieval2020-11-27T22:17:27Z<p>X46yan: </p>
<hr />
<div><br />
==Contributors==<br />
Pierre McWhannel wrote this summary with editorial and technical contributions from STAT 946 fall 2020 classmates. The summary is based on the paper "Pre-Training Tasks for Embedding-Based Large-Scale Retrieval" which was presented at ICLR 2020. The authors of this paper are Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, Sanjiv Kumar [1].<br />
<br />
==Introduction==<br />
The problem domain which the paper is positioned within is large-scale query-document retrieval. This problem is: given a query/question to collect relevant documents which contain the answer(s) and identify the span of words which contain the answer(s). This problem is often separated into two steps 1) a retrieval phase which reduces a large corpus to a much smaller subset and 2) a reader which is applied to read the shortened list of documents and score them based on their ability to answer the query and identify the span containing the answer, these spans can be ranked according to some scoring function. In the setting of open-domain question answering the query, <math>q</math> is a question and the documents <math>d</math> represent a passage which may contain the answer(s). The focus of this paper is on the retriever question answering (QA), in this setting a scoring function <math>f: \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R} </math> is utilized to map <math>(q,d) \in \mathcal{X} \times \mathcal{Y}</math> to a real value which represents the score of how relevant the passage is to the query. We desire high scores for relevant passages and and low scores otherwise and this is the job of the retriever. <br />
<br />
<br />
Certain characteristics desired of the retriever are high recall since the reader will only be applied to a subset meaning there is an an opportunity to identify false positive but not false negatives. The other characteristic desired of a retriever is low latency meaning it is computationally efficient. The popular approach to meet these heuristics has been the BM-25 [2] which relies on token-based matching between two high-dimensional and sparse vectors representing <math>q</math> and <math>d</math>. This approach performs well and retrieval can be performed in time sublinear to the number of passages with inverter indexing. The downfall of this type of algorithm is its inability to be optimized for a specific task. An alternate option is BERT or transformer based models with cross attention between query and passage pairs which can be optimized for a specific task. However, these models suffer from latency as the retriever needs to process all the pairwise combinations of a query with all passages. The next type of algorithm is an embedding-based model that jointly embeds the query and passage in the same embedding space and then uses measures such as cosine distance, inner product, or even the Euclidean distance in this space to get a score. The authors suggest the two-tower models can capture deeper semantic relationships in addition to being able to be optimized for specific tasks. This model is referred to as the "two-tower retrieval model", since it has two transformer based models where one embeds the query <math>\phi{(\cdot)}</math> and the other the passage <math>\psi{(\cdot)}</math>, then the embeddings can be scored by <math>f(q,d) = \langle \phi{(q)},\psi{(d)} \rangle \in \mathbb{R} </math>. This model can avoid the latency problems of a cross-layer model by pre-computing <math>\psi{(d)}</math> before hand and then by utilizing efficient approximate nearest neighbor search algorithms in the embedding space to find the nearest documents.<br />
<br />
<br />
These two-tower retrieval models often use BERT with pretrained weights, in BERT the pre-training tasks are masked-LM (MLM) and next sentence prediction (NSP). The authors note that the research of pre-training tasks that improve the performance of two-tower retrieval models is an unsolved research problem. This is exactly what the authors have done in this paper. That is, to develop pre-training tasks for the two-tower model based on heuristics and then validated by experimental results. The pre-training tasks suggested are Inverse Cloze Task (ICT), Body First Search (BFS), Wiki Link Prediction (WLP), and combining all three. The authors used the Retrieval Question-Answering ('''ReQA''') benchmark and used two datasets '''SQuAD''' and '''Natural Questions''' for training and evaluating their models.<br />
<br />
<br />
<br />
===Contributions of this paper===<br />
*The two-tower transformer model with proper pre-training can significantly outperform the widely used BM-25 Algorithm.<br />
*Paragraph-level pre-training tasks such as ICT, BFS, and WLP hugely improve the retrieval quality, whereas the most widely used pre-training task (the token-level masked-LM) gives only marginal gains.<br />
*The two-tower models with deep transformer encoders benefit more from paragraph-level pre-training compared to its shallow bag-of-word counterpart (BoW-MLP)<br />
<br />
==Background on Two-Tower Retrieval Models==<br />
This section is to present the two-tower model in more detail. As seen previously <math>q \in \mathcal{X}</math>, <math> d \in \mathcal{y}</math> are the query and document or passage respectively. The two-tower retrieval model consists of two encoders: <math>\phi: \mathcal{X} \rightarrow \mathbb{R}^k</math> and <math>\psi: \mathcal{Y} \rightarrow \mathbb{R}^k</math>. The tokens in <math>\mathcal{X},\mathcal{Y}</math> are a sequence of tokens representing the query and document respectively. The scoring function is <math> f(q,d) = \langle \phi{(q)},\psi{(d)} \rangle \in \mathbb{R} </math>. The cross-attention BERT-style model would have <math> f_{\theta,w}(q,d) = \psi_{\theta}{(q \oplus d)^{T}}w </math> as a scoring function and <math> \oplus </math> signifies the concatenation of the sequences of <math>q</math> and <math>d</math>, <math> \theta </math> are the parameters of the cross-attention model. The architectures of these two models can be seen below in figure 1.<br />
<br />
<br/><br />
[[File:two-tower-vs-cross-layer.PNG|center|Figure 1: end-to-end model]]<br />
<br/><br />
<br />
===Comparisons===<br />
<br />
====Inference====<br />
In terms of inference the two-tower architecture has a distinct advantage as being easily seen from the architecture. Since the query and document are embedded independently, <math> \psi{(d)}</math> can be pre computed. This is an important advantage as it makes the two-tower method feasible. As an example the authors state the inner product between 100 query embeddings and 1 million document embeddings only takes hundreds of milliseconds on CPUs, however, for the equivalent calculation for a cross-attention-model it takes hours on GPUs. In addition they remark the retrieval of the relevant documents can be done in sublinear time with respect <math> |\mathcal{Y}| </math> using a maximum inner product (MIPS) algorithm with little loss in recall.<br />
<br />
====Learning====<br />
In terms of learning the two-tower model compared to BM-25 has a unique advantage of being possible to be fine-tuned for downstream tasks. This problem is within the paradigm of metric learning as the scoring function can be passed as input to an objective function of the log likelihood <math> \max_{\theta} \space log(p_{\theta}(d|q)) </math>. The conditional probability is represented by a SoftMax and then can be rewritten as:<br />
<br />
<br />
<math> p_{\theta}(d|q) = \frac{exp(f_{\theta}(q,d))}{\sum_{d' \in D}} exp(f_{\theta}(q,d'))</math><br />
<br />
<br />
Here <math> \mathcal{D} </math> is the set of all possible documents, the denominator is a summation over <math>\mathcal{D} </math>. As is customary they utilized a sampled SoftMax technique to approximate the full-SoftMax. The technique they have employed is by using a small subset of documents in the current training batch, while also using a proper correcting term to ensure the unbiasedness of the partition function.<br />
<br />
====Why do we need pre-training then?====<br />
Although the two-tower model can be fine-tuned for a downstream task getting sufficiently labelled data is not always possible. This is an opportunity to look for better pre-training tasks as the authors have done. The authors suggest getting a set of pre-training task <math> \mathcal{T} </math> with labelled positive pairs of queries of documents/passages and then afterward fine-tuning on the limited amount of data for the downstream task.<br />
<br />
==Pre-Training Tasks==<br />
The authors suggest three pre-training tasks for the encoders of the two-tower retriever, these are ICT, BFS, and WLP. Keep in mind that ICT is not a newly proposed task. These pre-training tasks are developed in accordance to two heuristics they state and our paragraph-level pre-training tasks in contrast to more common sentence-level or word-level tasks:<br />
<br />
# The tasks should capture a different resolutions of semantics between query and document. The semantics can be the local context within a paragraph, global consistency within a document, and even semantic relation between two documents.<br />
# Collecting the data for the pre-training tasks should require as little human intervention as possible and be cost-efficient.<br />
<br />
All three of the following tasks can be constructed in a cost-efficient and with minimal human intervention by utilizing Wikipedia articles. In figure 2 below the dashed line surrounding the block of text is to be regarded as the document <math> d </math> and the rest of the circled portions of text make up the three queries <math>q_1, q_2, q_3 </math> selected for each task.<br />
<br />
<br/><br />
[[File:pre-train-tasks.PNG|center|Figure 1: end-to-end model]]<br />
<br/><br />
<br />
===Inverse Cloze Task (ICT)===<br />
This task is given a passage <math> p </math> consisting of <math> n </math> sentences <math>s_{i}, i=1,..,n </math>. <math> s_{i} </math> is randomly sampled and used as the query <math> q </math> and then the document is constructed to be based on the remaining <math> n-1 </math> sentences. This is utilized to meet the local context within a paragraph of the first heuristic. In figure 2 <math> q_1 </math> is an example of a potential query for ICT based on the selected document <math> d </math>.<br />
<br />
===Body First Search (BFS)===<br />
BFS is <math> (q,d) </math> pairs are created by randomly selecting a sentence <math> q_2 </math> and setting it as the query from the first section of a Wikipedia page which contains the selected document <math> d </math>. This is utilized to meet the global context consistency within a document as the first section in Wikipedia generally is an overview of the whole page and they anticipate it to contain information central to the topic. In figure 2 <math> q_2 </math> is an example of a potential query for BFS based on the selected document <math> d </math>.<br />
<br />
===Wiki Link Prediction (WLP)===<br />
The third task is selecting a hyperlink that is within the selected document <math> d </math>, as can be observed in figure 2 "machine learning" is a valid option. The query <math> q_3 </math> will then be a random sentence on the first section of the page where the hyperlinks redirects to. This is utilized to provide the semantic relation between documents.<br />
<br />
<br />
Additionally Masked LM (MLM) is considered during the experimentations which again is one of the primary pre-training tasks used in BERT.<br />
<br />
==Experiments and Results==<br />
<br />
===Training Specifications===<br />
<br />
*Each tower is constructed from the 12 layers BERT-base model.<br />
*Embedding is achieved by applying a linear layer on the [CLS] token output to get an embedding dimension of 512.<br />
*Sequence lengths of 64 and 288 respectively for the query and document encoders.<br />
*Pre-trained on 32 TPU v3 chips for 100k step with Adam optimizer learning rate of <math> 1 \times 10^{-4} </math> with 0.1 warm-up ratio, followed by linear learning rate decay and batch size of 8192. (~2.5 days to complete)<br />
*Fine tuning on downstream task <math> 5 \times 10^{-5} </math> with 2000 training steps and batch size 512.<br />
*Tokenizer is WordPiece.<br />
<br />
===Pre-Training Tasks Setup===<br />
<br />
*MLM, ICT, BFS, and WLP are used.<br />
*Various combinations of the aforementioned tasks are tested.<br />
*Tasks define <math> (q,d) </math> pairings.<br />
*ICT's document <math> d </math> has the article title and passage separated by a [SEP] symbol as input to the document encoder.<br />
<br />
Below table 1 shows some statistics for the datasets constructed for the ICT, BFS, and WLP tasks. Token counts considered after the tokenizer is applied.<br />
<br />
[[File: wiki-pre-train-statistics.PNG | center]]<br />
<br />
===Downstream Tasks Setup===<br />
Retrieval Question-Answering ('''ReQA''') benchmark used.<br />
<br />
*Datasets: '''SQuAD''' and '''Natural Questions'''.<br />
*Each entry of data from datasets is <math> (q,a,p) </math>, where each element is, query <math> q </math>, answer <math> a </math>, and passage <math> p </math> containing the answer, respectively.<br />
*Authors split the passage to sentences <math>s_i</math> where <math> i=1,...n </math> and <math> n </math> is the number of sentences in the passage.<br />
*The problem is then recast for a given query to retrieve the correct sentence-passage <math> (s,p) </math> from all candidates for all passages split in this fashion.<br />
*They remark their reformulation makes it a more difficult problem.<br />
<br />
Note: Experiments done on '''ReQA''' benchmarks are not entirely open-domain QA retrieval as the candidate <math> (s,p) </math> only cover the training set of QA dataset instead of entire Wikipedia articles. There is some experiments they did with augmented the dataset as will be seen in table 6.<br />
<br />
Below table 2 shows statistics of the datasets used for the downstream QA task.<br />
<br />
<br />
[[File:data-statistics-30.PNG | center]]<br />
<br />
===Evaluation===<br />
<br />
*3 train/test splits: 1%/99%, 5%/95%, and 80%/20%.<br />
*10% of training set is used as a validation set for hyper-parameter tuning.<br />
*Training never has seen the test queries in the test pairs of <math>(q,d).</math><br />
*Evaluation metric is recall@k since the goal is to capture positives in the top-k results.<br />
<br />
<br />
===Results===<br />
<br />
Table 3 shows the results on the '''SQuAD''' dataset, as observed the performance using the combination of all 3 pre-training tasks has the best performance, with the exception of R@1 for 1%/99% train/test split. BoW-MLP is used to justify the use of a transformer as the encoder since it is has more complexity, BoW-MLP here looks up uni-grams from an embedding table, aggregates the embeddings with average pooling, and passes them though a shallow two-layer ML network with <math> tanh </math> activation to generate the final 512-dimensional query/document embeddings.<br />
<br />
[[File:table3-30.PNG | center]]<br />
<br />
Table 4 shows the results on the '''Natural Questions''' dataset, as observed the performance using the combination of all 3 pre-training tasks has the best performance in all testing scenarios.<br />
<br />
[[File:table4-30.PNG | center]]<br />
<br />
====Ablation Study====<br />
The embedding-dimensions, # of layers in BERT and pre-training tasks are altered for this ablation study as seen in table 5.<br />
<br />
Interestingly ICT+BFS+WLP is only an absolute of 1.5% better than just ICT in the low-data regime. The authors infer this could be representative of when there is insufficient downstream data for training that more global pre-training tasks become increasingly beneficial, since BFS and WLP offer this.<br />
<br />
[[File:table5-30.PNG | center]]<br />
<br />
====Evaluation of Open-Domain Retrieval====<br />
Augment ReQA benchmark with large-scale (sentence, evidence passage) pairs extracted from general Wikipedia. This is added as one million external candidate pairs into the existing retrieval candidate set of the ReQA benchmark. Results here are fairly consistent with previous results.<br />
<br />
<br />
[[File:table6-30.PNG | center]]<br />
<br />
==Conclusion==<br />
<br />
The authors performed a comprehensive study and have concluded that properly designing paragraph-level pre-training tasks including ICT, BFS, and WLP for a two-tower transformer model can significantly outperform a BM-25 algorithm, which is an unsupervised method using weighted token-matching as a scoring function. The findings of this paper suggest that a two-tower transformer model with proper pre-training tasks can replace the BM25 algorithm used in the retrieval stage to further improve end-to-end system performance. The authors plan to test the pre-training tasks with a larger variety of encoder architectures and trying other corpora than Wikipedia and additionally trying pre-training in contrast to different regularization methods.<br />
<br />
==Critique==<br />
<br />
The authors of this paper suggest from their experiments that the combination of the 3 pretraining tasks ICT, BFS, and WLP yields the best results in general. However, they do acknowledge why two of the three pre-training tasks may not produce equally significant results as the three. From the ablation study and the small margin (1.5%) in which the three tasks achieved outperformed only using ICT in combination with observing ICT outperforming the three as seen in table 6 in certain scenarios. It seems plausible that a subset of two tasks could also potentially outperform the three or reach a similar performance. I believe this should be addressed in order to conclude three pre-training tasks are needed over two, though they never address this comparison I think it would make the paper more complete.<br />
<br />
==References==<br />
[1] Wei-Cheng Chang, Felix X Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. Pre-training tasks for embedding-based large-scale retrieval. arXiv preprint arXiv:2002.03932, 2020.<br />
<br />
[2] Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Extreme_Multi-label_Text_Classification&diff=47063Extreme Multi-label Text Classification2020-11-27T20:38:38Z<p>X46yan: </p>
<hr />
<div>== Presented By ==<br />
Mohan Wu<br />
<br />
== Introduction ==<br />
In this paper, the authors are interested a field of problems called extreme classification. These problems involve training a classifier to give the most relevant tags for any given text; the difficulties arise from the fact that the label set is so large so that most models would fail. The authors propose a new model called APLC-XLNet which fine-tunes the generalized autoregressive pre-trained model (XLNet) by using Adaptive Probabilistic Label Clusters (APLC) to calculate cross-entropy loss. This method takes advantage of unbalanced label distributions by forming clusters to reduce training time. The authors experimented on five different datasets and showed that their method far outweighs the existing state-of-the-art models.<br />
<br />
== Motivation ==<br />
Extreme multi-label text classification (XMTC) has applications in many recent problems such as providing word representations of a large vocabulary [1], tagging Wikipedia with relevant labels [2], and giving product descriptions for search advertisements [3]. The authors are motivated by the shortcomings of traditional methods in the creation of XMTC. For example, one such method of classifying text is the bag-of-words (BOW) approach where a vector represents the frequency of a word in a corpus. However, BOW does not consider the location of the words so it cannot determine context and semantics. Motivated by the success of transfer learning in a wide range of natural language processing (NLP) problems, the authors propose to adapt XLNet [4] to the XMTC problem. The final challenge is the nature of the labeling distribution can be very sparse for some labels. The authors solve this problem by combining the Probabilistic Label Tree [5] method and the Adaptive Softmax [6] to propose APLC.<br />
<br />
== Related Work ==<br />
Two approaches have been proposed to solve the XMTC problem: traditional BOW techniques, and modern deep learning models.<br />
<br />
=== BOW Approaches ===<br />
Intuitively, researchers can apply the one-vs-all approach in which they can fit a classifier for each label and thus XMTC reduces to a binary classification problem. This technique has been shown to achieve high accuracy; however, due to a large number of labels, this approach is very computationally expensive. There have been some techniques to reduce the complexity by pruning weights (PDSparse) to induce sparsity but still, this method is quite expensive. PPDSparse was introduced to parallelize PDSparse in a large-scaled distributed setting. Additionally, DiSMEC was introduced as another extension to PDSparse that allows for a distributed and parallel training mechanism which can be done efficiently by a novel negative sampling technique in addition to the pruning of spurious weights. Another approach is to simply apply a dimensional reduction technique on the label space. However, doing so has shown to have serious negative effects on the prediction accuracy due to loss of information in the compression phase. Finally, another approach is to use a tree to partition labels into groups based on similarity. This approach has shown to be quite fast but unfortunately, due to the problems with BOW methods, the accuracy is poor.<br />
<br />
=== Deep Learning Approaches ===<br />
Unlike BOW approaches, deep learning can learn dense representations of the corpus using context and semantics. One such example is X-BERT[7] which divides XTMC problems into 3 steps. First, it partitions the label set into clusters based on similarity. Next, it fits a BERT model on the label clusters for the given corpus. Finally, it trains simple linear classifiers to rank the labels in each cluster. Since there are elements of traditional techniques in X-BERT, namely, the clustering step and the linear classifier step, the authors propose to improve upon this approach.<br />
<br />
== APLC-XLNet ==<br />
APLC-XLNet consists of three parts: the pre-trained XLNet-Base as the base, the APLC output layer, and a fully connected hidden layer connecting the pooling layer of XLNet as the output layer, as it can be seen in figure 1. To recap, XLNet [8] is a generalized autoregressive pretraining language model based upon capturing bidirectional contexts through maximizing the expected likelihood over all permutations of the factorization<br />
the order which can be expressed as follows:<br />
<br />
\begin{equation}<br />
\underset{\theta}{\max}{ E_{z \sim Z_T}[\sum_{t=1}^{T} \log {p_{\theta}(x_{z_{t}}|x_{z<t})}] }<br />
\end{equation}<br />
<br />
where <math> \theta </math> denotes the parameters of the model, <math> Z_T </math> is the set of all possible permutations of the sequence with length T,<br />
<math> z_t </math> is the <math> t </math>-th element and <math> z < t </math> deontes the preceding <math> t-1 </math> elements in a permutation <math> z</math>.<br />
<br />
One major challenge in XTMC problems is that most data fall into a small group of labels. To tackle this challenge, the authors propose partitioning the label set into one head cluster, <math> V_h </math>, and many tail clusters, <math> V_1 \cup \ldots \cup V_K </math>. The head cluster contains the most popular labels while the tail clusters contain the rest of the labels. The clusters are then inserted in a 2-level tree where the root node is the head cluster and the leaves are the tail clusters. Using this architecture improves computation times significantly since most of the time the data stops at the root node.<br />
<br />
[[File:Capture1111.JPG |center|600px]]<br />
<br />
<div align="center">Figure 1: Architecture of the proposed APLC-XLNet model. V denotes the label cluster in APLC.</div><br />
<br />
The authors define the probability of each label as follows:<br />
<br />
\begin{equation}<br />
p(y_{ij} | x) = <br />
\begin{cases} <br />
p(y_{ij}|x) & \text{if } y_{ij} \in V_h \\<br />
p(V_t|x)p(y_{ij}|V_t,x) & \text{if } y_{ij} \in V_t<br />
\end{cases}<br />
\end{equation}<br />
<br />
where <math> x </math> is the feature of a given sample, <math> y_{ij} </math> is the j-th label in the i-th sample, and <math> V_t </math> is the t-th tail cluster. Let <math> Y_i </math> be the set of labels for the i-th sample, and define <math> L_i = |Y_i| </math>. The authors propose an intuitive objective loss function for multi-label classification:<br />
<br />
\begin{equation}<br />
J(\theta) = -\frac{1}{\sum_{i=1}^N L_i} \sum_{i=1}^N \sum_{j \in Y_i} (y_{ij} logp(y_{ij}) + (1 - y_{ij}) log(1- p(y_{ij}))<br />
\end{equation}<br />
where <math>N</math> is the number of samples, <math>p(y_{ij})</math> is defined above and <math> y_{ij} \in \{0, 1\} </math>.<br />
<br />
The number of parameters in this model is given by:<br />
\begin{equation}<br />
N_{par} = d(l_h + K) + \sum_{i=1}^K \frac{d}{q^i}(d+l_i)<br />
\end{equation}<br />
where <math> d </math> is the dimension of the hidden state of <math> V_h </math>, <math> q </math> is a decay variable, <math> l_h = |V_h| </math> and <math> l_i = |V_i| </math>. Furthermore, the computational cost can be expressed as follows:<br />
\begin{align}<br />
C &= C_h + \sum_{i=1}^K C_i \\<br />
&= O(N_b d(l_h + K)) + O(\sum_{i=1}^K p_i N_b \frac{d}{q^i}(l_i + d))<br />
\end{align}<br />
where <math> N_b </math> is the batch size. <br />
<br />
Note that in practical settings where <math>K\ll l_h</math> and <math>d\ll l_i</math> the computational cost can be rewritten as <math>O(N_b d(l_h + \sum_{i=1}^Kp_i\frac{l_i}{q_i}))</math>. Furthermore, the authors pointed out that, assuming all of <math>l_i</math> are fixed, the computational complexity can be reduced by configuring the model so that <math>p_i</math> is small, which corresponds to a low probability of a batch entering the tail cluster <math>i</math>, as the computational burden is dominated by the cost of entering tail clusters.<br />
<br />
=== Training APLC-XLNet ===<br />
Training APLC-XLNet essentially boils down to training its three parts. The authors suggest using discriminative fine-tuning method[9] to train the model entirely while assigning different learning rates to each part. Since XLNet is pretrained, the learning rate, <math> \eta_x </math>, should be small, while the output layer is specific to this type of problem so its learning rate, <math> \eta_a </math>, should be large. For the connecting hidden layer, the authors chose a learning rate, <math> \eta_h </math>, such that <math> \eta_x < \eta_h < \eta_a </math>. For each of the learning rates, the authors suggest a slanted triangular learning schedule[9] defined as:<br />
\begin{equation}<br />
\eta =<br />
\begin{cases} <br />
\eta_0 \frac{t}{t_w} & \text{if } t \leq t_w \\<br />
\eta_0 \frac{t_a - t}{t_a - t_w} & \text{if } t > t_w<br />
\end{cases}<br />
\end{equation}<br />
where <math> \eta_0 </math> is the starting learning rate, <math> t </math> is the current step, <math> t_w </math> is the chosen warm-up threshold and <math> t_a </math> is the total number of steps. The objective here is to motivate the model to converge quickly to the suitable space at the beginning and then refine the parameters. Learning rates are first increased linearly, and then decayed gradually according to the strategy.<br />
<br />
== Results ==<br />
The authors tested the APLC-XLNet model in several benchmark datasets against current state-of-the-art baselines, including two embedding based models, SLEEC and AnnexML, the 1-vs-all model DisMEC, three tree-based approaches, PfastreXML, Parabel and Bonsai, and two deep learning approaches, XML-CNN and AttentionXML. The evaluation metric, P@k is defined as:<br />
\begin{equation}<br />
P@k = \frac{1}{k} \sum_{i \in rank_k(\hat{y})} y_i<br />
\end{equation}<br />
where <math> rank_k(\hat{y}) </math> is the top k ranked probability in the prediction vector, <math> \hat{y} </math>.<br />
<br />
As shown in the table below, APLC-XLNet outperforms the two embedding based models, SLEEC and AnnexML, across all five datasets. It outperforms DisMEC on three datasets except for Wiki-500k and Amazon-670k. Among the three tree based methods, Bonsai performs the best. And APLC-XLNet outperforms Bonsai on three datasets. At last, APLC-XLNet also outperforms the deep learning methods on four datasets. <br />
[[File:Paper.PNG|1000px|]]<br />
<br />
To help with tuning the model in the number of clusters and different partitions, the authors experimented on two different datasets: EURLex and Wiki10.<br />
<br />
[[File:XTMC2.PNG|1000px|]]<br />
<br />
The three different partitions in the second graph the authors used were (0.7, 0.2, 0.1), (0.33, 0.33, 0.34), and (0.1, 0.2, 0.7) where 3 clusters were fixed.<br />
<br />
== Conclusion ==<br />
The authors have proposed a new deep learning approach to solve the XMTC problem based on XLNet, namely, APLC-XLNet. APLC-XLNet consists of three parts: the pretrained XLNet that takes the input of the text, a connecting hidden layer, and finally an APLC output layer to give the rankings of relevant labels. Their experiments show that APLC-XLNet has better results in several benchmark datasets over the current state-of-the-art models.<br />
<br />
== Critques ==<br />
The authors chose to use the same architecture for every dataset. The model does not achieve state-of-the-art performance on the larger datasets. Perhaps, a more complex model in the second part of the model could help achieve better results. The authors also put a lot of effort in explaining the model complexity for APLC-XLNet but does not compare it other state-of-the-art models. A table of model parameters and complexity for each model could be helpful in explaining why their techniques are efficient. <br />
<br />
In their implementation, the label set was partitioned into clusters in a heuristic manner by experimenting different number of clusters and partition proportion. There might be a better model-based method to choose the partition structure.<br />
<br />
== References ==<br />
[1] Mikolov, T., Kombrink, S., Burget, L., Cernock ˇ y, J., and `<br />
Khudanpur, S. Extensions of recurrent neural network<br />
language model. In 2011 IEEE International Conference<br />
on Acoustics, Speech and Signal Processing (ICASSP),<br />
pp. 5528–5531. IEEE, 2011. <br />
<br />
[2] Dekel, O. and Shamir, O. Multiclass-multilabel classification with more classes than examples. In Proceedings<br />
of the Thirteenth International Conference on Artificial<br />
Intelligence and Statistics, pp. 137–144, 2010. <br />
<br />
[3] Jain, H., Prabhu, Y., and Varma, M. Extreme multi-label loss<br />
functions for recommendation, tagging, ranking & other<br />
missing label applications. In Proceedings of the 22nd<br />
ACM SIGKDD International Conference on Knowledge<br />
Discovery and Data Mining, pp. 935–944. ACM, 2016. <br />
<br />
[4] Yang, W., Xie, Y., Lin, A., Li, X., Tan, L., Xiong, K., Li,<br />
M., and Lin, J. End-to-end open-domain question answering with BERTserini. In NAACL-HLT (Demonstrations),<br />
2019a. <br />
<br />
[5] Jasinska, K., Dembczynski, K., Busa-Fekete, R.,<br />
Pfannschmidt, K., Klerx, T., and Hullermeier, E. Extreme f-measure maximization using sparse probability<br />
estimates. In International Conference on Machine Learning, pp. 1435–1444, 2016. <br />
<br />
[6] Grave, E., Joulin, A., Cisse, M., J ´ egou, H., et al. Effi- ´<br />
cient softmax approximation for gpus. In Proceedings of<br />
the 34th International Conference on Machine LearningVolume 70, pp. 1302–1310. JMLR.org, 2017 <br />
<br />
[7] Wei-Cheng, C., Hsiang-Fu, Y., Kai, Z., Yiming, Y., and<br />
Inderjit, D. X-BERT: eXtreme Multi-label Text Classification using Bidirectional Encoder Representations from<br />
Transformers. In NeurIPS Science Meets Engineering of<br />
Deep Learning Workshop, 2019.<br />
<br />
[8] Yang, Zhilin, et al. "Xlnet: Generalized autoregressive pretraining for language understanding." <br />
Advances in neural information processing systems. 2019.<br />
<br />
[9] Howard, J. and Ruder, S. Universal language model finetuning for text classification. In Proceedings of the 56th<br />
Annual Meeting of the Association for Computational<br />
Linguistics (Volume 1: Long Papers), pp. 328–339, 2018</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=45351stat940F212020-11-19T03:48:21Z<p>X46yan: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||<br />
|-<br />
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || [https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration Summary] ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization Summary]|| [https://youtu.be/b9MU5cc3-m0 Presentation Video]<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification Summary] || [https://drive.google.com/file/d/1Dx6mFL_zBAJcfPQdOWAuPn0_HkvTL_0z/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES || [https://openreview.net/pdf?id=HJxdTxHYvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates Summary] || [[https://drive.google.com/file/d/1amkWrR8ZQKnnInjedRZ7jbXTqCA8Hy1r/view?usp=sharing Presentation ]] ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUE Summary] || [[https://youtu.be/5h-365TPQqE Presentation ]]<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations Summary] || Learn<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning Summary] || [https://www.youtube.com/watch?v=WKUj30tgHfs&feature=youtu.be video]<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_Features Summary]|| [https://youtu.be/djrJG6pJaL0 video] also available on Learn<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION Summary] ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Dense Passage Retrieval for Open-Domain Question Answering || [https://arxiv.org/abs/2004.04906 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering Summary] || Learn<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| AdaCompress: Adaptive Compression for Online Computer Vision Services || [https://arxiv.org/pdf/1909.08148.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adacompress:_Adaptive_compression_for_online_computer_vision_services Summary] ||<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || ||<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pre-Training_Tasks_For_Embedding-Based_Large-Scale_Retrieval Do Not Review Yet]|| Learn<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION&diff=45258DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION2020-11-18T02:44:16Z<p>X46yan: </p>
<hr />
<div>== Presented by == <br />
Bowen You<br />
<br />
== Introduction == <br />
<br />
Reinforcement learning refers to training a neural network to make a series of decisions dependent on a complex, evolving environment. Typically, this is accomplished by 'rewarding' or 'penalizing' the network based on its behaviors over time. Intelligent agents are able to accomplish tasks which may not have been seen in prior experiences. For recent reviews of reinforcement learning, see [3,4]. One way to achieve this is to represent the world based on past experiences. In this paper, the authors propose an agent that learns long-horizon behaviors purely by latent imagination and outperforms previous agents in terms of data efficiency, computation time, and final performance.<br />
<br />
=== Preliminaries ===<br />
<br />
This section aims to define a few key concepts in reinforcement learning. In the typical reinforcement problem, an <b>agent</b> interacts with the <b>environment</b>. The environment is typically defined by a <b>model</b> that may or may not be known. The environment may be characterized by its <b>state</b> <math display="inline"> s \in \mathcal{S}</math>. The agent may choose to take <b>actions</b> <math display="inline"> a \in \mathcal{A}</math> to interact with the environment. Once an action is taken, the environment returns a <b>reward</b> <math display="inline"> r \in \mathcal{R}</math>as feedback.<br />
<br />
The actions an agent decides to take is defined by a <b>policy</b> function <math display="inline"> \pi : \mathcal{S} \to \mathcal{A}</math>. <br />
Additionally we define functions <math display="inline"> V_{\pi} : \mathcal{S} \to \mathbb{R} \in \mathcal{S}</math> and <math display="inline"> Q_{\pi} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math> to represent the value function and action-value functions of a given policy <math display="inline">\pi</math> respectively.<br />
<br />
Thus the goal is to find an optimal policy <math display="inline">\pi_{*}</math> such that <br />
\[<br />
\pi_{*} = \arg\max_{\pi} V_{\pi}(s) = \arg\max_{\pi} Q_{\pi}(s, a)<br />
\]<br />
<br />
=== Feedback Loop ===<br />
<br />
Given this framework, agents are able to interact with the environment in a sequential fashion, namely a sequence of actions, states, and rewards. Let <math display="inline"> S_t, A_t, R_t</math> denote the state, action, and reward obtained at time <math display="inline"> t = 1, 2, \ldots, T</math>. We call the tuple <math display="inline">(S_t, A_t, R_t)</math> one <b>episode</b>. This can be thought of as a feedback loop or a sequence<br />
\[<br />
S_1, A_1, R_1, S_2, A_2, R_2, \ldots, S_T<br />
\]<br />
<br />
== Motivation ==<br />
<br />
In many problems, the amount of actions an agent is able to take is limited. Then it is difficult to interact with the environment to learn an accurate representation of the world. The proposed method in this paper aims to solve this problem by "imagining" the state and reward that the action will provide. That is, given a state <math display="inline">S_t</math>, the proposed method generates <br />
\[<br />
\hat{A}_t, \hat{R}_t, \hat{S}_{t+1}, \ldots<br />
\]<br />
<br />
By doing this, an agent is able to plan-ahead and perceive a representation of the environment without interacting with it. Once an action is made, the agent is able to update their representation of the world by the actual observation. This is particularly useful in applications where experience is not easily obtained. <br />
<br />
== Dreamer == <br />
<br />
The authors of the paper call their method Dreamer. At high-level, Dreamer first learns latent dynamics from past experience, then it learns actions and states from imagined trajectories to maximize future action rewards. Finally, it predicts the next action and executes it. This whole process is illustrated below. <br />
<br />
[[File: dreamer_overview.png | 800px]]<br />
<br />
<br />
Let's look at Dreamer in detail. It consists of:<br />
* Representation <math display="inline">p_{\theta}(s_t | s_{t-1}, a_{t-1}, o_{t}) </math><br />
* Transition <math display="inline">q_{\theta}(s_t | s_{t-1}, a_{t-1}) </math><br />
* Reward <math display="inline"> q_{\theta}(r_t | s_t)</math><br />
* Action <math display="inline"> q_{\phi}(a_t | s_t)</math><br />
* Value <math display="inline"> v_{\psi}(s_t)</math><br />
<br />
where <math display="inline"> \theta, \phi, \psi</math> are learned neural network parameters.<br />
<br />
There are three main components to the proposed algorithm:<br />
* Dynamics Learning: Using past experience data, the agent learns to encode observations and actions into latent states and predicts environment rewards. One way to do this is via representation learning.<br />
* Behavior Learning: In the latent space, the agent predicts state values and actions that maximize the future rewards through back-propagation.<br />
* Environment Interaction: The agent encodes the episode to compute the current model state and predict the next action to interact with the environment.<br />
<br />
The proposed algorithm is described below.<br />
<br />
[[File:dreamer.png|frameless|500px|Dreamer algorithm]]<br />
<br />
Notice that there are three neural networks that are trained simultaneously. <br />
The neural networks with parameters <math display="inline"> \theta, \phi, \psi </math> correspond to models of the environment, action and values respectively.<br />
<br />
== Results ==<br />
<br />
The figure below summarizes the performance of Dreamer compared to other state-of-the-art reinforcement learning agents for continuous control tasks. Overall, it achieves the most consistent performance among them. Additionally, while other agents heavily rely on prior experience, Dreamer is able to learn behaviors with minimal interactions with the environment.<br />
<br />
[[File:scores.png|frameless|500px|Comparison of RL-agents against several continuous control tasks]]<br />
<br />
== Conclusion ==<br />
<br />
This paper presented a new algorithm for training reinforcement learning agents with minimal interactions with the environment. The algorithm outperforms many previous algorithms in terms of computation time and overall performance. This has many practical applications as many agents rely on prior experience which may be hard to obtain in the real-world. Although it may be an extreme example, consider a reinforcement learning agent that learns how to perform rare surgeries may not have enough data samples. This paper shows that it is possible to train agents without requiring many prior interactions with the environment.<br />
<br />
== References ==<br />
<br />
[1] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations (ICLR), 2020.<br />
<br />
[2] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.<br />
<br />
[3] Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6), 26–38.<br />
<br />
[4] Nian, R., Liu, J., & Huang, B. (2020). A review On reinforcement learning: Introduction and applications in industrial process control. Computers and Chemical Engineering, 139, 106886.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION&diff=45257DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION2020-11-18T02:43:39Z<p>X46yan: </p>
<hr />
<div>== Presented by == <br />
Bowen You<br />
<br />
== Introduction == <br />
<br />
Reinforcement learning refers to training a neural network to make a series of decisions dependent on a complex, evolving environment. Typically, this is accomplished by 'rewarding' or 'penalizing' the network based on its behaviors over time. Intelligent agents are able to accomplish tasks which may not have been seen in prior experiences. For recent reviews of reinforcement learning, see [3,4]. One way to achieve this is to represent the world based on past experiences. In this paper, the authors propose an agent that learns long-horizon behaviors purely by latent imagination and outperforms previous agents in terms of data efficiency, computation time, and final performance.<br />
<br />
=== Preliminaries ===<br />
<br />
This section aims to define a few key concepts in reinforcement learning. In the typical reinforcement problem, an <b>agent</b> interacts with the <b>environment</b>. The environment is typically defined by a <b>model</b> that may or may not be known. The environment may be characterized by its <b>state</b> <math display="inline"> s \in \mathcal{S}</math>. The agent may choose to take <b>actions</b> <math display="inline"> a \in \mathcal{A}</math> to interact with the environment. Once an action is taken, the environment returns a <b>reward</b> <math display="inline"> r \in \mathcal{R}</math>as feedback.<br />
<br />
The actions an agent decides to take is defined by a <b>policy</b> function <math display="inline"> \pi : \mathcal{S} \to \mathcal{A}</math>. <br />
Additionally we define functions <math display="inline"> V_{\pi} : \mathcal{S} \to \mathbb{R} \in \mathcal{S}</math> and <math display="inline"> Q_{\pi} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math> to represent the value function and action-value functions of a given policy <math display="inline">\pi</math> respectively.<br />
<br />
Thus the goal is to find an optimal policy <math display="inline">\pi_{*}</math> such that <br />
\[<br />
\pi_{*} = \arg\max_{\pi} V_{\pi}(s) = \arg\max_{\pi} Q_{\pi}(s, a)<br />
\]<br />
<br />
=== Feedback Loop ===<br />
<br />
Given this framework, agents are able to interact with the environment in a sequential fashion, namely a sequence of actions, states, and rewards. Let <math display="inline"> S_t, A_t, R_t</math> denote the state, action, and reward obtained at time <math display="inline"> t = 1, 2, \ldots, T</math>. We call the tuple <math display="inline">(S_t, A_t, R_t)</math> one <b>episode</b>. This can be thought of as a feedback loop or a sequence<br />
\[<br />
S_1, A_1, R_1, S_2, A_2, R_2, \ldots, S_T<br />
\]<br />
<br />
== Motivation ==<br />
<br />
In many problems, the amount of actions an agent is able to take is limited. Then it is difficult to interact with the environment to learn an accurate representation of the world. The proposed method in this paper aims to solve this problem by "imagining" the state and reward that the action will provide. That is, given a state <math display="inline">S_t</math>, the proposed method generates <br />
\[<br />
\hat{A}_t, \hat{R}_t, \hat{S}_{t+1}, \ldots<br />
\]<br />
<br />
By doing this, an agent is able to plan-ahead and perceive a representation of the environment without interacting with it. Once an action is made, the agent is able to update their representation of the world by the actual observation. This is particularly useful in applications where experience is not easily obtained. <br />
<br />
== Dreamer == <br />
<br />
The authors of the paper call their method Dreamer. At high-level, Dreamer first learns latent dynamics from past experience, then it learns actions and states from imagined trajectories to maximize future action rewards. Finally, it predicts the next action and executes it. This whole process is illustrated below. <br />
<br />
[[File: dreamer_overview.png | 600px]]<br />
<br />
<br />
It consists of:<br />
* Representation <math display="inline">p_{\theta}(s_t | s_{t-1}, a_{t-1}, o_{t}) </math><br />
* Transition <math display="inline">q_{\theta}(s_t | s_{t-1}, a_{t-1}) </math><br />
* Reward <math display="inline"> q_{\theta}(r_t | s_t)</math><br />
* Action <math display="inline"> q_{\phi}(a_t | s_t)</math><br />
* Value <math display="inline"> v_{\psi}(s_t)</math><br />
<br />
where <math display="inline"> \theta, \phi, \psi</math> are learned neural network parameters.<br />
<br />
There are three main components to the proposed algorithm:<br />
* Dynamics Learning: Using past experience data, the agent learns to encode observations and actions into latent states and predicts environment rewards. One way to do this is via representation learning.<br />
* Behavior Learning: In the latent space, the agent predicts state values and actions that maximize the future rewards through back-propagation.<br />
* Environment Interaction: The agent encodes the episode to compute the current model state and predict the next action to interact with the environment.<br />
<br />
The proposed algorithm is described below.<br />
<br />
[[File:dreamer.png|frameless|500px|Dreamer algorithm]]<br />
<br />
Notice that there are three neural networks that are trained simultaneously. <br />
The neural networks with parameters <math display="inline"> \theta, \phi, \psi </math> correspond to models of the environment, action and values respectively.<br />
<br />
== Results ==<br />
<br />
The figure below summarizes the performance of Dreamer compared to other state-of-the-art reinforcement learning agents for continuous control tasks. Overall, it achieves the most consistent performance among them. Additionally, while other agents heavily rely on prior experience, Dreamer is able to learn behaviors with minimal interactions with the environment.<br />
<br />
[[File:scores.png|frameless|500px|Comparison of RL-agents against several continuous control tasks]]<br />
<br />
== Conclusion ==<br />
<br />
This paper presented a new algorithm for training reinforcement learning agents with minimal interactions with the environment. The algorithm outperforms many previous algorithms in terms of computation time and overall performance. This has many practical applications as many agents rely on prior experience which may be hard to obtain in the real-world. Although it may be an extreme example, consider a reinforcement learning agent that learns how to perform rare surgeries may not have enough data samples. This paper shows that it is possible to train agents without requiring many prior interactions with the environment.<br />
<br />
== References ==<br />
<br />
[1] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations (ICLR), 2020.<br />
<br />
[2] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.<br />
<br />
[3] Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6), 26–38.<br />
<br />
[4] Nian, R., Liu, J., & Huang, B. (2020). A review On reinforcement learning: Introduction and applications in industrial process control. Computers and Chemical Engineering, 139, 106886.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:dreamer_overview.png&diff=45254File:dreamer overview.png2020-11-18T02:22:16Z<p>X46yan: </p>
<hr />
<div></div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_Features&diff=45246Model Agnostic Learning of Semantic Features2020-11-18T01:08:36Z<p>X46yan: </p>
<hr />
<div>== Presented by ==<br />
Milad Sikaroudi<br />
<br />
== Introduction ==<br />
Transfer learning is a line of research in machine learning which focuses on storing knowledge from one domain (source domain) to solve a similar problem in another domain (target domain). In addition to regular transfer learning, one can use "transfer metric learning" in which through utilizing similarity relationship between samples [1], [2] a more robust and discriminative data representation is formed. However, both of these kinds of techniques work insofar as the domain shift, between source and target domains, is negligible. Domain shift is defined as the deviation in the distribution of the source domain and the target domain and it would cause the DNN model to completely fail. The multi-domain learning is the solution when the assumption of "source domain and target domain comes from an almost same distribution" may not hold. There are two variants of MDL in the literature that can be confused, i.e. domain generalization, and domain adaptation; however in domain adaptation, we have access to the target domain data somehow, while that is not the case in domain generalization. This paper introduces a technique for domain generalization based on two complementary losses that regularize the semantic structure of the feature space through an episodic training scheme originally inspired by the model-agnostic meta-learning.<br />
<br />
== Previous Work ==<br />
<br />
Originated from model-agnostic meta-learning (MAML), episodic training has been vastly leveraged for addressing domain generalization [3, 4, 5, 7, 8, 6, 9, 10, 11]. The method of MLDG [4] closely follows MAML in terms of back-propagating the gradients from an ordinary task loss on meta-test data, but it has its own limitation as the use of the task objective might be sub-optimal since it only uses class probabilities. Most of the works [3,7] in the literature lack notable guidance from the semantics of feature space, which contains crucial domain-independent ‘general knowledge’ that can be useful for domain generalization. The authors claim that their method is orthogonal to previous works.<br />
<br />
<br />
=== Model Agnostic Meta Learning ===<br />
a.k.a learning to learn is a learning paradigm in which optimal initial weights are found incrementally (episodic training) by minimizing a loss function over some similar tasks (meta-train, meta-test sets). Imagine a 4-shot 2-class image classification task as below:<br />
[[File:p5.png|800px|center]]<br />
Each of the training tasks provides an optimal initial weight for the next round of the training. By considering all of these sets of updates and meta-test set, the updated weights are calculated using the below algorithm.<br />
[[File:algo1.PNG|500px|center]]<br />
<br />
== Method ==<br />
In domain generalization, we assume that there are some domain-invariant patterns in the inputs (e.g. semantic features). These features can be extracted to learn a predictor that performs well across seen and unseen domains. This paper assumes that there are inter-class relationships across domains. In total, the MASF is composed of a '''task loss''', '''global class alignment''' term and a '''local sample clustering''' term.<br />
<br />
=== Task loss ===<br />
<math> F_{\psi}: X \rightarrow Z</math> where <math> Z </math> is a feature space<br />
<math> T_{\theta}: X \rightarrow \mathbf {R}^{C}</math> where <math> C </math> is the number of classes in <math> Y </math><br />
Assume that <math>\hat{y}= softmax(T_{\theta}(F_{\psi}(x))) </math>. The parameters <math> (\psi, \theta) </math> are optimized with minimizing a cross-entropy loss namely <math> \mathbf{L}_{task} </math> formulated as:<br />
<br />
<div style="text-align: center;"><br />
<math> l_{task}(y, \hat{y} = - \sum_{c}1[y=C]log(\hat{y}_{c})) </math><br />
</div><br />
<br />
Although the task loss is a decent predictor nothing prevents the model from overfitting to the source domains and suffering from degradation on unseen test domains. So the other loss terms are responsible for this aim.<br />
<br />
=== Global class alignment ===<br />
In semantic space, we assume there are relationships between class concepts. And those relationships are invariant to changes in observation domains. Capturing and preserving such class relationships can help models generalize well on unseen data. To achieve this, a global layout of extracted features are imposed such that the relative locations of extracted features reflect their semantic similarity. Since <math> L_{task} </math> focuses only on the dominant hard label prediction, the inter-class alignment across domains is disregarded. Hence, minimising symmetrized Kullback–Leibler (KL) divergence across domains, averaged over all <math> C </math> classes has been used:<br />
<div style="text-align: center;"> <br />
<math> l_{global}(D_{i}, D{j}; \psi^{'}, \theta^{'}) = 1/C \sum_{c=1}^{C} 1/2[D_{KL}(s_{c}^{(i)}||s_{c}^{(j)}) + D_{KL}(s_{c}^{(j)}||s_{c}^{(i)})], </math><br />
</div><br />
The authors stated that symmetric divergences such as Jensen–Shannon (JS) showed no significant difference with KL over symm.<br />
<br />
=== Local cluster sampling ===<br />
<math> L_{global} </math> captures inter-class relationships, we also want to make semantic features close to each other locally. Explicit metric learning, i.e. contrastive or triplet losses, have been used to ensure that the semantic features, locally cluster according to only class labels, regardless of the domain. Contrastive loss takes two samples as input and makes samples of the same class closer, while pushing away samples of different classes.<br />
[[File: contrastive.png | 400px]]<br />
<br />
Conversely, triplet loss takes three samples as input: one anchor, one positive, and one negative. Triplet loss tries to make relevant samples closer than irrelevant ones.<br />
<div style="text-align: center;"><br />
<math><br />
l_{triplet}^{a,p,n} = \sum_{i=1}^{b} \sum_{k=1}^{c-1} \sum_{\ell=1}^{c-1}\! [m\!+\!\|x_{i}\!- \!x_{k}\|_2^2 \!-\! \|x_{i}\!-\!x_{\ell}\|_2^2 ]_+,<br />
</math><br />
</div><br />
<br />
== Model agnostic learning of semantic features ==<br />
These losses are used in an episodic training scheme showed in the below figure:<br />
[[File:algo2.PNG|700px|center]]<br />
<br />
== Experiments ==<br />
The usefulness of the proposed method has been demonstrated using two common benchmark datasets for domain generalization, i.e. VLCS and PACS, alongside a real-world MRI medical imaging segmentation task. In all of their experiments, the AlexNet with ImageNet pre-trained weights has been utilized. <br />
<br />
=== VLCS ===<br />
VLCS[12] is an aggregation of images from four other datasets: PASCAL VOC2007 (V) [13], LabelMe (L) [14], Caltech (C) [15], and SUN09 (S) [16] <br />
leave-one-domain-out validation with randomly dividing each domain into 70% training and 30% test.<br />
<br />
<gallery><br />
File:p6.PNG|VLCS dataset<br />
</gallery><br />
<br />
Notably, MASF outperforms MLDG[4], in the table below on this dataset, indicating that semantic properties would provide superior performance with respect to purely highly-abstracted task loss on meta-test. "DeepAll" in the table is the case in which there is no domain generalization. In DeepAll case the class labels have been used only, regardless of the domain each sample would lie in. <br />
<br />
[[File:table1_masf.PNG|600px|center]]<br />
<br />
=== PACS ===<br />
The more challenging domain generalization benchmark with a significant domain shift is the PACS dataset [17]. It contains art painting, cartoon, photo, sketch domains with objects from seven classes: dog, elephant, giraffe, guitar, house, horse, person.<br />
<gallery><br />
File:p7_masf.jpg|PACS dataset sample<br />
</gallery> <br />
<br />
As you can see in the table below, MASF outperforms state of the art JiGen[18], MLDG[4], MetaReg[3], significantly. In addition, the best improvement has achieved (6.20%) when the unseen domain is "sketch", which requires more general knowledge about semantic concepts since it is different from other domains significantly.<br />
<br />
[[File:table2_masf.PNG|600px|center]]<br />
<br />
=== Ablation study over PACS===<br />
The ablation study over the PACS dataset shows the effectiveness of each loss term. <br />
[[File:table3_masf.PNG|600px|center]]<br />
<br />
=== Deeper Architectures ===<br />
For stronger baseline results, authors have performed additional experiments using advanced deep residual architectures like ResNet-18 and ResNet-50. The below table shows strong and consistent improvements of MASF over the DeepAll baseline in all PACS splits for both network architectures. This suggests that the proposed algorithm is also beneficial for domain generalization with deeper feature extractors.<br />
[[File:Paper18_PacResults.PNG|600px|center]]<br />
<br />
=== Multi-site Brain MRI image segmentation === <br />
<br />
The effectiveness of the MASF has been also demonstrated using a segmentation task of MRI images gathering from four different clinical centers denoted as (Set-A, Set-B, Set-C, and Set-D). The domain shift, in this case, would occur due to differences in hardware, acquisition protocols, and many other factors, hindering translating learning-based methods to real clinical practice. <br />
<br />
<gallery><br />
File:p8_masf.PNG|MRI dataset<br />
</gallery> <br />
<br />
<br />
The results showed the effectiveness of the MASF in comparison to not use domain generalization.<br />
[[File:table5_masf.PNG|300px|center]]<br />
<br />
== Conclusion ==<br />
<br />
A new domain generalization technique by taking the advantage of incorporating global and local constraints for learning semantic feature spaces presented which outperforms the state-of-the-art. The effectiveness of this method has been demonstrated using two domain generalization benchmarks, and a real clinical dataset (MRI image segmentation). The code if freely available at [19]. As a future work, it would be interesting to integrate the proposed loss functions with other methods as they are orthogonal to each other and evaluate the benefit of doing so. Also, investigating the usage of the current learning procedure in the context of generative models would be an interesting research direction.<br />
<br />
== References ==<br />
<br />
[1]: Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. "Siamese neural networks for one-shot image recognition." ICML deep learning workshop. Vol. 2. 2015.<br />
<br />
[2]: Hoffer, Elad, and Nir Ailon. "Deep metric learning using triplet network." International Workshop on Similarity-Based Pattern Recognition. Springer, Cham, 2015.<br />
<br />
[3]: Balaji, Yogesh, Swami Sankaranarayanan, and Rama Chellappa. "Metareg: Towards domain generalization using meta-regularization." Advances in Neural Information Processing Systems. 2018.<br />
<br />
[4]: Li, Da, et al. "Learning to generalize: Meta-learning for domain generalization." arXiv preprint arXiv:1710.03463 (2017).<br />
<br />
[5]: Li, Da, et al. "Episodic training for domain generalization." Proceedings of the IEEE International Conference on Computer Vision. 2019.<br />
<br />
[6]: Li, Haoliang, et al. "Domain generalization with adversarial feature learning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.<br />
<br />
[7]: Li, Yiying, et al. "Feature-critic networks for heterogeneous domain generalization." arXiv preprint arXiv:1901.11448 (2019).<br />
<br />
[8]: Ghifary, Muhammad, et al. "Domain generalization for object recognition with multi-task autoencoders." Proceedings of the IEEE international conference on computer vision. 2015.<br />
<br />
[9]: Li, Ya, et al. "Deep domain generalization via conditional invariant adversarial networks." Proceedings of the European Conference on Computer Vision (ECCV). 2018<br />
<br />
[10]: Motiian, Saeid, et al. "Unified deep supervised domain adaptation and generalization." Proceedings of the IEEE International Conference on Computer Vision. 2017.<br />
<br />
[11]: Muandet, Krikamol, David Balduzzi, and Bernhard Schölkopf. "Domain generalization via invariant feature representation." International Conference on Machine Learning. 2013.<br />
<br />
[12]: Fang, Chen, Ye Xu, and Daniel N. Rockmore. "Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias." Proceedings of the IEEE International Conference on Computer Vision. 2013.<br />
<br />
[13]: Everingham, Mark, et al. "The pascal visual object classes (voc) challenge." International journal of computer vision 88.2 (2010): 303-338.<br />
<br />
[14]: Russell, Bryan C., et al. "LabelMe: a database and web-based tool for image annotation." International journal of computer vision 77.1-3 (2008): 157-173.<br />
<br />
[15]: Fei-Fei, Li. "Learning generative visual models from few training examples." Workshop on Generative-Model Based Vision, IEEE Proc. CVPR, 2004. 2004.<br />
<br />
[16]: Chopra, Sumit, Raia Hadsell, and Yann LeCun. "Learning a similarity metric discriminatively, with application to face verification." 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 1. IEEE, 2005.<br />
<br />
[17]: Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. "Deeper, broader and artier domain generalization". IEEE International Conference on Computer Vision (ICCV), 2017. <br />
<br />
[18]: Carlucci, Fabio M., et al. "Domain generalization by solving jigsaw puzzles." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.<br />
<br />
[19]: https://github.com/biomedia-mira/masf</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_Features&diff=45245Model Agnostic Learning of Semantic Features2020-11-18T01:08:10Z<p>X46yan: </p>
<hr />
<div>== Presented by ==<br />
Milad Sikaroudi<br />
<br />
== Introduction ==<br />
Transfer learning is a line of research in machine learning which focuses on storing knowledge from one domain (source domain) to solve a similar problem in another domain (target domain). In addition to regular transfer learning, one can use "transfer metric learning" in which through utilizing similarity relationship between samples [1], [2] a more robust and discriminative data representation is formed. However, both of these kinds of techniques work insofar as the domain shift, between source and target domains, is negligible. Domain shift is defined as the deviation in the distribution of the source domain and the target domain and it would cause the DNN model to completely fail. The multi-domain learning is the solution when the assumption of "source domain and target domain comes from an almost same distribution" may not hold. There are two variants of MDL in the literature that can be confused, i.e. domain generalization, and domain adaptation; however in domain adaptation, we have access to the target domain data somehow, while that is not the case in domain generalization. This paper introduces a technique for domain generalization based on two complementary losses that regularize the semantic structure of the feature space through an episodic training scheme originally inspired by the model-agnostic meta-learning.<br />
<br />
== Previous Work ==<br />
<br />
Originated from model-agnostic meta-learning (MAML), episodic training has been vastly leveraged for addressing domain generalization [3, 4, 5, 7, 8, 6, 9, 10, 11]. The method of MLDG [4] closely follows MAML in terms of back-propagating the gradients from an ordinary task loss on meta-test data, but it has its own limitation as the use of the task objective might be sub-optimal since it only uses class probabilities. Most of the works [3,7] in the literature lack notable guidance from the semantics of feature space, which contains crucial domain-independent ‘general knowledge’ that can be useful for domain generalization. The authors claim that their method is orthogonal to previous works.<br />
<br />
<br />
=== Model Agnostic Meta Learning ===<br />
a.k.a learning to learn is a learning paradigm in which optimal initial weights are found incrementally (episodic training) by minimizing a loss function over some similar tasks (meta-train, meta-test sets). Imagine a 4-shot 2-class image classification task as below:<br />
[[File:p5.png|800px|center]]<br />
Each of the training tasks provides an optimal initial weight for the next round of the training. By considering all of these sets of updates and meta-test set, the updated weights are calculated using the below algorithm.<br />
[[File:algo1.PNG|500px|center]]<br />
<br />
== Method ==<br />
In domain generalization, we assume that there are some domain-invariant patterns in the inputs (e.g. semantic features). These features can be extracted to learn a predictor that performs well across seen and unseen domains. This paper assumes that there are inter-class relationships across domains. In total, the MASF is composed of a '''task loss''', '''global class alignment''' term and a '''local sample clustering''' term.<br />
<br />
=== Task loss ===<br />
<math> F_{\psi}: X \rightarrow Z</math> where <math> Z </math> is a feature space<br />
<math> T_{\theta}: X \rightarrow \mathbf {R}^{C}</math> where <math> C </math> is the number of classes in <math> Y </math><br />
Assume that <math>\hat{y}= softmax(T_{\theta}(F_{\psi}(x))) </math>. The parameters <math> (\psi, \theta) </math> are optimized with minimizing a cross-entropy loss namely <math> \mathbf{L}_{task} </math> formulated as:<br />
<br />
<div style="text-align: center;"><br />
<math> l_{task}(y, \hat{y} = - \sum_{c}1[y=C]log(\hat{y}_{c})) </math><br />
</div><br />
<br />
Although the task loss is a decent predictor nothing prevents the model from overfitting to the source domains and suffering from degradation on unseen test domains. So the other loss terms are responsible for this aim.<br />
<br />
=== Global class alignment ===<br />
In semantic space, we assume there are relationships between class concepts. And those relationships are invariant to changes in observation domains. Capturing and preserving such class relationships can help models generalize well on unseen data. To achieve this, a global layout of extracted features are imposed such that the relative locations of extracted features reflect their semantic similarity. Since <math> L_{task} </math> focuses only on the dominant hard label prediction, the inter-class alignment across domains is disregarded. Hence, minimising symmetrized Kullback–Leibler (KL) divergence across domains, averaged over all <math> C </math> classes has been used:<br />
<div style="text-align: center;"> <br />
<math> l_{global}(D_{i}, D{j}; \psi^{'}, \theta^{'}) = 1/C \sum_{c=1}^{C} 1/2[D_{KL}(s_{c}^{(i)}||s_{c}^{(j)}) + D_{KL}(s_{c}^{(j)}||s_{c}^{(i)})], </math><br />
</div><br />
The authors stated that symmetric divergences such as Jensen–Shannon (JS) showed no significant difference with KL over symm.<br />
<br />
=== Local cluster sampling ===<br />
<math> L_{global} </math> captures inter-class relationships, we also want to make semantic features close to each other locally. Explicit metric learning, i.e. contrastive or triplet losses, have been used to ensure that the semantic features, locally cluster according to only class labels, regardless of the domain. Contrastive loss takes two samples as input and makes samples of the same class closer, while pushing away samples of different classes.<br />
[[File: contrastive.png | 200px]]<br />
<br />
Conversely, triplet loss takes three samples as input: one anchor, one positive, and one negative. Triplet loss tries to make relevant samples closer than irrelevant ones.<br />
<div style="text-align: center;"><br />
<math><br />
l_{triplet}^{a,p,n} = \sum_{i=1}^{b} \sum_{k=1}^{c-1} \sum_{\ell=1}^{c-1}\! [m\!+\!\|x_{i}\!- \!x_{k}\|_2^2 \!-\! \|x_{i}\!-\!x_{\ell}\|_2^2 ]_+,<br />
</math><br />
</div><br />
<br />
== Model agnostic learning of semantic features ==<br />
These losses are used in an episodic training scheme showed in the below figure:<br />
[[File:algo2.PNG|700px|center]]<br />
<br />
== Experiments ==<br />
The usefulness of the proposed method has been demonstrated using two common benchmark datasets for domain generalization, i.e. VLCS and PACS, alongside a real-world MRI medical imaging segmentation task. In all of their experiments, the AlexNet with ImageNet pre-trained weights has been utilized. <br />
<br />
=== VLCS ===<br />
VLCS[12] is an aggregation of images from four other datasets: PASCAL VOC2007 (V) [13], LabelMe (L) [14], Caltech (C) [15], and SUN09 (S) [16] <br />
leave-one-domain-out validation with randomly dividing each domain into 70% training and 30% test.<br />
<br />
<gallery><br />
File:p6.PNG|VLCS dataset<br />
</gallery><br />
<br />
Notably, MASF outperforms MLDG[4], in the table below on this dataset, indicating that semantic properties would provide superior performance with respect to purely highly-abstracted task loss on meta-test. "DeepAll" in the table is the case in which there is no domain generalization. In DeepAll case the class labels have been used only, regardless of the domain each sample would lie in. <br />
<br />
[[File:table1_masf.PNG|600px|center]]<br />
<br />
=== PACS ===<br />
The more challenging domain generalization benchmark with a significant domain shift is the PACS dataset [17]. It contains art painting, cartoon, photo, sketch domains with objects from seven classes: dog, elephant, giraffe, guitar, house, horse, person.<br />
<gallery><br />
File:p7_masf.jpg|PACS dataset sample<br />
</gallery> <br />
<br />
As you can see in the table below, MASF outperforms state of the art JiGen[18], MLDG[4], MetaReg[3], significantly. In addition, the best improvement has achieved (6.20%) when the unseen domain is "sketch", which requires more general knowledge about semantic concepts since it is different from other domains significantly.<br />
<br />
[[File:table2_masf.PNG|600px|center]]<br />
<br />
=== Ablation study over PACS===<br />
The ablation study over the PACS dataset shows the effectiveness of each loss term. <br />
[[File:table3_masf.PNG|600px|center]]<br />
<br />
=== Deeper Architectures ===<br />
For stronger baseline results, authors have performed additional experiments using advanced deep residual architectures like ResNet-18 and ResNet-50. The below table shows strong and consistent improvements of MASF over the DeepAll baseline in all PACS splits for both network architectures. This suggests that the proposed algorithm is also beneficial for domain generalization with deeper feature extractors.<br />
[[File:Paper18_PacResults.PNG|600px|center]]<br />
<br />
=== Multi-site Brain MRI image segmentation === <br />
<br />
The effectiveness of the MASF has been also demonstrated using a segmentation task of MRI images gathering from four different clinical centers denoted as (Set-A, Set-B, Set-C, and Set-D). The domain shift, in this case, would occur due to differences in hardware, acquisition protocols, and many other factors, hindering translating learning-based methods to real clinical practice. <br />
<br />
<gallery><br />
File:p8_masf.PNG|MRI dataset<br />
</gallery> <br />
<br />
<br />
The results showed the effectiveness of the MASF in comparison to not use domain generalization.<br />
[[File:table5_masf.PNG|300px|center]]<br />
<br />
== Conclusion ==<br />
<br />
A new domain generalization technique by taking the advantage of incorporating global and local constraints for learning semantic feature spaces presented which outperforms the state-of-the-art. The effectiveness of this method has been demonstrated using two domain generalization benchmarks, and a real clinical dataset (MRI image segmentation). The code if freely available at [19]. As a future work, it would be interesting to integrate the proposed loss functions with other methods as they are orthogonal to each other and evaluate the benefit of doing so. Also, investigating the usage of the current learning procedure in the context of generative models would be an interesting research direction.<br />
<br />
== References ==<br />
<br />
[1]: Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. "Siamese neural networks for one-shot image recognition." ICML deep learning workshop. Vol. 2. 2015.<br />
<br />
[2]: Hoffer, Elad, and Nir Ailon. "Deep metric learning using triplet network." International Workshop on Similarity-Based Pattern Recognition. Springer, Cham, 2015.<br />
<br />
[3]: Balaji, Yogesh, Swami Sankaranarayanan, and Rama Chellappa. "Metareg: Towards domain generalization using meta-regularization." Advances in Neural Information Processing Systems. 2018.<br />
<br />
[4]: Li, Da, et al. "Learning to generalize: Meta-learning for domain generalization." arXiv preprint arXiv:1710.03463 (2017).<br />
<br />
[5]: Li, Da, et al. "Episodic training for domain generalization." Proceedings of the IEEE International Conference on Computer Vision. 2019.<br />
<br />
[6]: Li, Haoliang, et al. "Domain generalization with adversarial feature learning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.<br />
<br />
[7]: Li, Yiying, et al. "Feature-critic networks for heterogeneous domain generalization." arXiv preprint arXiv:1901.11448 (2019).<br />
<br />
[8]: Ghifary, Muhammad, et al. "Domain generalization for object recognition with multi-task autoencoders." Proceedings of the IEEE international conference on computer vision. 2015.<br />
<br />
[9]: Li, Ya, et al. "Deep domain generalization via conditional invariant adversarial networks." Proceedings of the European Conference on Computer Vision (ECCV). 2018<br />
<br />
[10]: Motiian, Saeid, et al. "Unified deep supervised domain adaptation and generalization." Proceedings of the IEEE International Conference on Computer Vision. 2017.<br />
<br />
[11]: Muandet, Krikamol, David Balduzzi, and Bernhard Schölkopf. "Domain generalization via invariant feature representation." International Conference on Machine Learning. 2013.<br />
<br />
[12]: Fang, Chen, Ye Xu, and Daniel N. Rockmore. "Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias." Proceedings of the IEEE International Conference on Computer Vision. 2013.<br />
<br />
[13]: Everingham, Mark, et al. "The pascal visual object classes (voc) challenge." International journal of computer vision 88.2 (2010): 303-338.<br />
<br />
[14]: Russell, Bryan C., et al. "LabelMe: a database and web-based tool for image annotation." International journal of computer vision 77.1-3 (2008): 157-173.<br />
<br />
[15]: Fei-Fei, Li. "Learning generative visual models from few training examples." Workshop on Generative-Model Based Vision, IEEE Proc. CVPR, 2004. 2004.<br />
<br />
[16]: Chopra, Sumit, Raia Hadsell, and Yann LeCun. "Learning a similarity metric discriminatively, with application to face verification." 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 1. IEEE, 2005.<br />
<br />
[17]: Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. "Deeper, broader and artier domain generalization". IEEE International Conference on Computer Vision (ICCV), 2017. <br />
<br />
[18]: Carlucci, Fabio M., et al. "Domain generalization by solving jigsaw puzzles." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.<br />
<br />
[19]: https://github.com/biomedia-mira/masf</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:contrastive.png&diff=45244File:contrastive.png2020-11-18T01:07:41Z<p>X46yan: </p>
<hr />
<div></div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_Features&diff=45243Model Agnostic Learning of Semantic Features2020-11-18T00:57:18Z<p>X46yan: </p>
<hr />
<div>== Presented by ==<br />
Milad Sikaroudi<br />
<br />
== Introduction ==<br />
Transfer learning is a line of research in machine learning which focuses on storing knowledge from one domain (source domain) to solve a similar problem in another domain (target domain). In addition to regular transfer learning, one can use "transfer metric learning" in which through utilizing similarity relationship between samples [1], [2] a more robust and discriminative data representation is formed. However, both of these kinds of techniques work insofar as the domain shift, between source and target domains, is negligible. Domain shift is defined as the deviation in the distribution of the source domain and the target domain and it would cause the DNN model to completely fail. The multi-domain learning is the solution when the assumption of "source domain and target domain comes from an almost same distribution" may not hold. There are two variants of MDL in the literature that can be confused, i.e. domain generalization, and domain adaptation; however in domain adaptation, we have access to the target domain data somehow, while that is not the case in domain generalization. This paper introduces a technique for domain generalization based on two complementary losses that regularize the semantic structure of the feature space through an episodic training scheme originally inspired by the model-agnostic meta-learning.<br />
<br />
== Previous Work ==<br />
<br />
Originated from model-agnostic meta-learning (MAML), episodic training has been vastly leveraged for addressing domain generalization [3, 4, 5, 7, 8, 6, 9, 10, 11]. The method of MLDG [4] closely follows MAML in terms of back-propagating the gradients from an ordinary task loss on meta-test data, but it has its own limitation as the use of the task objective might be sub-optimal since it only uses class probabilities. Most of the works [3,7] in the literature lack notable guidance from the semantics of feature space, which contains crucial domain-independent ‘general knowledge’ that can be useful for domain generalization. The authors claim that their method is orthogonal to previous works.<br />
<br />
<br />
=== Model Agnostic Meta Learning ===<br />
a.k.a learning to learn is a learning paradigm in which optimal initial weights are found incrementally (episodic training) by minimizing a loss function over some similar tasks (meta-train, meta-test sets). Imagine a 4-shot 2-class image classification task as below:<br />
[[File:p5.png|800px|center]]<br />
Each of the training tasks provides an optimal initial weight for the next round of the training. By considering all of these sets of updates and meta-test set, the updated weights are calculated using the below algorithm.<br />
[[File:algo1.PNG|500px|center]]<br />
<br />
== Method ==<br />
In domain generalization, we assume that there are some domain-invariant patterns in the inputs (e.g. semantic features). These features can be extracted to learn a predictor that performs well across seen and unseen domains. This paper assumes that there are inter-class relationships across domains. In total, the MASF is composed of a '''task loss''', '''global class alignment''' term and a '''local sample clustering''' term.<br />
<br />
=== Task loss ===<br />
<math> F_{\psi}: X \rightarrow Z</math> where <math> Z </math> is a feature space<br />
<math> T_{\theta}: X \rightarrow \mathbf {R}^{C}</math> where <math> C </math> is the number of classes in <math> Y </math><br />
Assume that <math>\hat{y}= softmax(T_{\theta}(F_{\psi}(x))) </math>. The parameters <math> (\psi, \theta) </math> are optimized with minimizing a cross-entropy loss namely <math> \mathbf{L}_{task} </math> formulated as:<br />
<br />
<div style="text-align: center;"><br />
<math> l_{task}(y, \hat{y} = - \sum_{c}1[y=C]log(\hat{y}_{c})) </math><br />
</div><br />
<br />
Although the task loss is a decent predictor nothing prevents the model from overfitting to the source domains and suffering from degradation on unseen test domains. So the other loss terms are responsible for this aim.<br />
<br />
=== Global class alignment ===<br />
In semantic space, we assume there are relationships between class concepts. And those relationships are invariant to changes in observation domains. Capturing and preserving such class relationships can help models generalize well on unseen data. To achieve this, a global layout of extracted features are imposed such that the relative locations of extracted features reflect their semantic similarity. Since <math> L_{task} </math> focuses only on the dominant hard label prediction, the inter-class alignment across domains is disregarded. Hence, minimising symmetrized Kullback–Leibler (KL) divergence across domains, averaged over all <math> C </math> classes has been used:<br />
<div style="text-align: center;"> <br />
<math> l_{global}(D_{i}, D{j}; \psi^{'}, \theta^{'}) = 1/C \sum_{c=1}^{C} 1/2[D_{KL}(s_{c}^{(i)}||s_{c}^{(j)}) + D_{KL}(s_{c}^{(j)}||s_{c}^{(i)})], </math><br />
</div><br />
The authors stated that symmetric divergences such as Jensen–Shannon (JS) showed no significant difference with KL over symm.<br />
<br />
=== Local cluster sampling ===<br />
<math> L_{global} </math> captures inter-class relationships, we also want to make semantic features close to each other locally. Explicit metric learning, i.e. contrastive or triplet losses, have been used to ensure that the semantic features, locally cluster according to only class labels, regardless of the domain.<br />
<div style="text-align: center;"><br />
<math><br />
l_{triplet}^{a,p,n} = \sum_{i=1}^{b} \sum_{k=1}^{c-1} \sum_{\ell=1}^{c-1}\! [m\!+\!\|x_{i}\!- \!x_{k}\|_2^2 \!-\! \|x_{i}\!-\!x_{\ell}\|_2^2 ]_+,<br />
</math><br />
</div><br />
<br />
== Model agnostic learning of semantic features ==<br />
These losses are used in an episodic training scheme showed in the below figure:<br />
[[File:algo2.PNG|700px|center]]<br />
<br />
== Experiments ==<br />
The usefulness of the proposed method has been demonstrated using two common benchmark datasets for domain generalization, i.e. VLCS and PACS, alongside a real-world MRI medical imaging segmentation task. In all of their experiments, the AlexNet with ImageNet pre-trained weights has been utilized. <br />
<br />
=== VLCS ===<br />
VLCS[12] is an aggregation of images from four other datasets: PASCAL VOC2007 (V) [13], LabelMe (L) [14], Caltech (C) [15], and SUN09 (S) [16] <br />
leave-one-domain-out validation with randomly dividing each domain into 70% training and 30% test.<br />
<br />
<gallery><br />
File:p6.PNG|VLCS dataset<br />
</gallery><br />
<br />
Notably, MASF outperforms MLDG[4], in the table below on this dataset, indicating that semantic properties would provide superior performance with respect to purely highly-abstracted task loss on meta-test. "DeepAll" in the table is the case in which there is no domain generalization. In DeepAll case the class labels have been used only, regardless of the domain each sample would lie in. <br />
<br />
[[File:table1_masf.PNG|600px|center]]<br />
<br />
=== PACS ===<br />
The more challenging domain generalization benchmark with a significant domain shift is the PACS dataset [17]. It contains art painting, cartoon, photo, sketch domains with objects from seven classes: dog, elephant, giraffe, guitar, house, horse, person.<br />
<gallery><br />
File:p7_masf.jpg|PACS dataset sample<br />
</gallery> <br />
<br />
As you can see in the table below, MASF outperforms state of the art JiGen[18], MLDG[4], MetaReg[3], significantly. In addition, the best improvement has achieved (6.20%) when the unseen domain is "sketch", which requires more general knowledge about semantic concepts since it is different from other domains significantly.<br />
<br />
[[File:table2_masf.PNG|600px|center]]<br />
<br />
=== Ablation study over PACS===<br />
The ablation study over the PACS dataset shows the effectiveness of each loss term. <br />
[[File:table3_masf.PNG|600px|center]]<br />
<br />
=== Deeper Architectures ===<br />
For stronger baseline results, authors have performed additional experiments using advanced deep residual architectures like ResNet-18 and ResNet-50. The below table shows strong and consistent improvements of MASF over the DeepAll baseline in all PACS splits for both network architectures. This suggests that the proposed algorithm is also beneficial for domain generalization with deeper feature extractors.<br />
[[File:Paper18_PacResults.PNG|600px|center]]<br />
<br />
=== Multi-site Brain MRI image segmentation === <br />
<br />
The effectiveness of the MASF has been also demonstrated using a segmentation task of MRI images gathering from four different clinical centers denoted as (Set-A, Set-B, Set-C, and Set-D). The domain shift, in this case, would occur due to differences in hardware, acquisition protocols, and many other factors, hindering translating learning-based methods to real clinical practice. <br />
<br />
<gallery><br />
File:p8_masf.PNG|MRI dataset<br />
</gallery> <br />
<br />
<br />
The results showed the effectiveness of the MASF in comparison to not use domain generalization.<br />
[[File:table5_masf.PNG|300px|center]]<br />
<br />
== Conclusion ==<br />
<br />
A new domain generalization technique by taking the advantage of incorporating global and local constraints for learning semantic feature spaces presented which outperforms the state-of-the-art. The effectiveness of this method has been demonstrated using two domain generalization benchmarks, and a real clinical dataset (MRI image segmentation). The code if freely available at [19]. As a future work, it would be interesting to integrate the proposed loss functions with other methods as they are orthogonal to each other and evaluate the benefit of doing so. Also, investigating the usage of the current learning procedure in the context of generative models would be an interesting research direction.<br />
<br />
== References ==<br />
<br />
[1]: Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. "Siamese neural networks for one-shot image recognition." ICML deep learning workshop. Vol. 2. 2015.<br />
<br />
[2]: Hoffer, Elad, and Nir Ailon. "Deep metric learning using triplet network." International Workshop on Similarity-Based Pattern Recognition. Springer, Cham, 2015.<br />
<br />
[3]: Balaji, Yogesh, Swami Sankaranarayanan, and Rama Chellappa. "Metareg: Towards domain generalization using meta-regularization." Advances in Neural Information Processing Systems. 2018.<br />
<br />
[4]: Li, Da, et al. "Learning to generalize: Meta-learning for domain generalization." arXiv preprint arXiv:1710.03463 (2017).<br />
<br />
[5]: Li, Da, et al. "Episodic training for domain generalization." Proceedings of the IEEE International Conference on Computer Vision. 2019.<br />
<br />
[6]: Li, Haoliang, et al. "Domain generalization with adversarial feature learning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.<br />
<br />
[7]: Li, Yiying, et al. "Feature-critic networks for heterogeneous domain generalization." arXiv preprint arXiv:1901.11448 (2019).<br />
<br />
[8]: Ghifary, Muhammad, et al. "Domain generalization for object recognition with multi-task autoencoders." Proceedings of the IEEE international conference on computer vision. 2015.<br />
<br />
[9]: Li, Ya, et al. "Deep domain generalization via conditional invariant adversarial networks." Proceedings of the European Conference on Computer Vision (ECCV). 2018<br />
<br />
[10]: Motiian, Saeid, et al. "Unified deep supervised domain adaptation and generalization." Proceedings of the IEEE International Conference on Computer Vision. 2017.<br />
<br />
[11]: Muandet, Krikamol, David Balduzzi, and Bernhard Schölkopf. "Domain generalization via invariant feature representation." International Conference on Machine Learning. 2013.<br />
<br />
[12]: Fang, Chen, Ye Xu, and Daniel N. Rockmore. "Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias." Proceedings of the IEEE International Conference on Computer Vision. 2013.<br />
<br />
[13]: Everingham, Mark, et al. "The pascal visual object classes (voc) challenge." International journal of computer vision 88.2 (2010): 303-338.<br />
<br />
[14]: Russell, Bryan C., et al. "LabelMe: a database and web-based tool for image annotation." International journal of computer vision 77.1-3 (2008): 157-173.<br />
<br />
[15]: Fei-Fei, Li. "Learning generative visual models from few training examples." Workshop on Generative-Model Based Vision, IEEE Proc. CVPR, 2004. 2004.<br />
<br />
[16]: Chopra, Sumit, Raia Hadsell, and Yann LeCun. "Learning a similarity metric discriminatively, with application to face verification." 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 1. IEEE, 2005.<br />
<br />
[17]: Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. "Deeper, broader and artier domain generalization". IEEE International Conference on Computer Vision (ICCV), 2017. <br />
<br />
[18]: Carlucci, Fabio M., et al. "Domain generalization by solving jigsaw puzzles." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.<br />
<br />
[19]: https://github.com/biomedia-mira/masf</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_Features&diff=45242Model Agnostic Learning of Semantic Features2020-11-18T00:56:30Z<p>X46yan: </p>
<hr />
<div>== Presented by ==<br />
Milad Sikaroudi<br />
<br />
== Introduction ==<br />
Transfer learning is a line of research in machine learning which focuses on storing knowledge from one domain (source domain) to solve a similar problem in another domain (target domain). In addition to regular transfer learning, one can use "transfer metric learning" in which through utilizing similarity relationship between samples [1], [2] a more robust and discriminative data representation is formed. However, both of these kinds of techniques work insofar as the domain shift, between source and target domains, is negligible. Domain shift is defined as the deviation in the distribution of the source domain and the target domain and it would cause the DNN model to completely fail. The multi-domain learning is the solution when the assumption of "source domain and target domain comes from an almost same distribution" may not hold. There are two variants of MDL in the literature that can be confused, i.e. domain generalization, and domain adaptation; however in domain adaptation, we have access to the target domain data somehow, while that is not the case in domain generalization. This paper introduces a technique for domain generalization based on two complementary losses that regularize the semantic structure of the feature space through an episodic training scheme originally inspired by the model-agnostic meta-learning.<br />
<br />
== Previous Work ==<br />
<br />
Originated from model-agnostic meta-learning (MAML), episodic training has been vastly leveraged for addressing domain generalization [3, 4, 5, 7, 8, 6, 9, 10, 11]. The method of MLDG [4] closely follows MAML in terms of back-propagating the gradients from an ordinary task loss on meta-test data, but it has its own limitation as the use of the task objective might be sub-optimal since it only uses class probabilities. Most of the works [3,7] in the literature lack notable guidance from the semantics of feature space, which contains crucial domain-independent ‘general knowledge’ that can be useful for domain generalization. The authors claim that their method is orthogonal to previous works.<br />
<br />
<br />
=== Model Agnostic Meta Learning ===<br />
a.k.a learning to learn is a learning paradigm in which optimal initial weights are found incrementally (episodic training) by minimizing a loss function over some similar tasks (meta-train, meta-test sets). Imagine a 4-shot 2-class image classification task as below:<br />
[[File:p5.png|800px|center]]<br />
Each of the training tasks provides an optimal initial weight for the next round of the training. By considering all of these sets of updates and meta-test set, the updated weights are calculated using the below algorithm.<br />
[[File:algo1.PNG|500px|center]]<br />
<br />
== Method ==<br />
In domain generalization, we assume that there are some domain-invariant patterns in the inputs (e.g. semantic features). These features can be extracted to learn a predictor that performs well across seen and unseen domains. This paper assumes that there are inter-class relationships across domains. In total, the MASF is composed of a '''task loss''', '''global class alignment''' term and a '''local sample clustering''' term.<br />
<br />
=== Task loss ===<br />
<math> F_{\psi}: X \rightarrow Z</math> where <math> Z </math> is a feature space<br />
<math> T_{\theta}: X \rightarrow \mathbf {R}^{C}</math> where <math> C </math> is the number of classes in <math> Y </math><br />
Assume that <math>\hat{y}= softmax(T_{\theta}(F_{\psi}(x))) </math>. The parameters <math> (\psi, \theta) </math> are optimized with minimizing a cross-entropy loss namely <math> \mathbf{L}_{task} </math> formulated as:<br />
<br />
<div style="text-align: center;"><br />
<math> l_{task}(y, \hat{y} = - \sum_{c}1[y=C]log(\hat{y}_{c})) </math><br />
</div><br />
<br />
Although the task loss is a decent predictor nothing prevents the model from overfitting to the source domains and suffering from degradation on unseen test domains. So the other loss terms are responsible for this aim.<br />
<br />
=== Global class alignment ===<br />
In semantic space, we assume there are relationships between class concepts. And those relationships are invariant to changes in observation domains. Capturing and preserving such class relationships can help models generalize well on unseen data. To achieve this, a global layout of extracted features are imposed such that the relative locations of extracted features reflect their semantic similarity. Since <math> L_{task} </math> focuses only on the dominant hard label prediction, the inter-class alignment across domains is disregarded. Hence, minimising symmetrized Kullback–Leibler (KL) divergence across domains, averaged over all <math> C </math> classes has been used:<br />
<div style="text-align: center;"> <br />
<math> l_{global}(D_{i}, D{j}; \psi^{'}, \theta^{'}) = 1/C \sum_{c=1}^{C} 1/2[D_{KL}(s_{c}^{(i)}||s_{c}^{(j)}) + D_{KL}(s_{c}^{(j)}||s_{c}^{(i)})], </math><br />
</div><br />
The authors stated that symmetric divergences such as Jensen–Shannon (JS) showed no significant difference with KL over symm.<br />
<br />
=== Local cluster sampling ===<br />
<math> L_{global} </math> captures inter-class relationships, we also want to make semantic features close to each other locally. Explicit metric learning, i.e. contrastive or triplet losses, have been used to ensure that the semantic features, locally cluster according to only class labels, regardless of the domain.<br />
<div style="text-align: center;"><br />
<math><br />
l_{triplet}^{a,p,n} = \sum_{i=1}^{b} \sum_{k=1}^{c-1} \sum_{\ell=1}^{c-1}\! [m\!+\!\|x_{i}\!- \!x_{k}\|_2^2 \!-\! \|x_{i}\!-\!x_{\ell}\|_2^2 ]_+,<br />
</math><br />
</div><br />
<br />
== Model agnostic learning of semantic features ==<br />
These losses are used in an episodic training scheme showed in the below figure:<br />
[[File:algo2.PNG|700px|center]]<br />
<br />
== Experiments ==<br />
The usefulness of the proposed method has been demonstrated using two common benchmark datasets for domain generalization, i.e. VLCS and PACS, alongside a real-world MRI medical imaging segmentation task. In all of their experiments, the AlexNet with ImageNet pre-trained weights has been utilized. <br />
<br />
=== VLCS ===<br />
VLCS[12] is an aggregation of images from four other datasets: PASCAL VOC2007 (V) [13], LabelMe (L) [14], Caltech (C) [15], and SUN09 (S) [16] <br />
leave-one-domain-out validation with randomly dividing each domain into 70% training and 30% test.<br />
<br />
<gallery><br />
File:p6.PNG|VLCS dataset<br />
</gallery><br />
<br />
Notably, MASF outperforms MLDG[4], in the table below on this dataset, indicating that semantic properties would provide superior performance with respect to purely highly-abstracted task loss on meta-test. "DeepAll" in the table is the case in which there is no domain generalization. In DeepAll case the class labels have been used only, regardless of the domain each sample would lie in. <br />
<br />
[[File:table1_masf.PNG|600px|center]]<br />
<br />
=== PACS ===<br />
The more challenging domain generalization benchmark with a significant domain shift is the PACS dataset [17]. It contains art painting, cartoon, photo, sketch domains with objects from seven classes: dog, elephant, giraffe, guitar, house, horse, person.<br />
<gallery><br />
File:p7_masf.jpg|PACS dataset sample<br />
</gallery> <br />
<br />
As you can see in the table below, MASF outperforms state of the art JiGen[18], MLDG[4], MetaReg[3], significantly. In addition, the best improvement has achieved (6.20%) when the unseen domain is "sketch", which requires more general knowledge about semantic concepts since it is different from other domains significantly.<br />
<br />
[[File:table2_masf.PNG|600px|center]]<br />
<br />
=== Ablation study over PACS===<br />
The ablation study over the PACS dataset shows the effectiveness of each loss term. <br />
[[File:table3_masf.PNG|600px|center]]<br />
<br />
=== Deeper Architectures ===<br />
For stronger baseline results, authors have performed additional experiments using advanced deep residual architectures like ResNet-18 and ResNet-50. The below table shows strong and consistent improvements of MASF over the DeepAll baseline in all PACS splits for both network architectures. This suggests that the proposed algorithm is also beneficial for domain generalization with deeper feature extractors.<br />
[[File:Paper18_PacResults.PNG|600px|center]]<br />
<br />
=== Multi-site Brain MRI image segmentation === <br />
<br />
The effectiveness of the MASF has been also demonstrated using a segmentation task of MRI images gathering from four different clinical centers denoted as (Set-A, Set-B, Set-C, and Set-D). The domain shift, in this case, would occur due to differences in hardware, acquisition protocols, and many other factors, hindering translating learning-based methods to real clinical practice. <br />
<br />
<gallery><br />
File:p8_masf.PNG|MRI dataset<br />
</gallery> <br />
<br />
<br />
The results showed the effectiveness of the MASF in comparison to not use domain generalization.<br />
[[File:table5_masf.PNG|300px|center]]<br />
<br />
== Conclusion ==<br />
<br />
A new domain generalization technique by taking the advantage of incorporating global and local constraints for learning semantic feature spaces presented which outperforms the state-of-the-art. The effectiveness of this method has been demonstrated using two domain generalization benchmarks, and a real clinical dataset (MRI image segmentation). The code if freely available at [19]. As a future work, it would be interesting to integrate the proposed loss functions with other methods as they are orthogonal to each other and evaluate the benefit of doing so. Also, investigating the usage of the current learning procedure in the context of generative models would be an interesting research direction.<br />
<br />
== References ==<br />
<br />
[1]: Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. "Siamese neural networks for one-shot image recognition." ICML deep learning workshop. Vol. 2. 2015.<br />
<br />
[2]: Hoffer, Elad, and Nir Ailon. "Deep metric learning using triplet network." International Workshop on Similarity-Based Pattern Recognition. Springer, Cham, 2015.<br />
<br />
[3]: Balaji, Yogesh, Swami Sankaranarayanan, and Rama Chellappa. "Metareg: Towards domain generalization using meta-regularization." Advances in Neural Information Processing Systems. 2018.<br />
<br />
[4]: Li, Da, et al. "Learning to generalize: Meta-learning for domain generalization." arXiv preprint arXiv:1710.03463 (2017).<br />
<br />
[5]: Li, Da, et al. "Episodic training for domain generalization." Proceedings of the IEEE International Conference on Computer Vision. 2019.<br />
<br />
[6]: Li, Haoliang, et al. "Domain generalization with adversarial feature learning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.<br />
<br />
[7]: Li, Yiying, et al. "Feature-critic networks for heterogeneous domain generalization." arXiv preprint arXiv:1901.11448 (2019).<br />
<br />
[8]: Ghifary, Muhammad, et al. "Domain generalization for object recognition with multi-task autoencoders." Proceedings of the IEEE international conference on computer vision. 2015.<br />
<br />
[9]: Li, Ya, et al. "Deep domain generalization via conditional invariant adversarial networks." Proceedings of the European Conference on Computer Vision (ECCV). 2018<br />
<br />
[10]: Motiian, Saeid, et al. "Unified deep supervised domain adaptation and generalization." Proceedings of the IEEE International Conference on Computer Vision. 2017.<br />
<br />
[11]: Muandet, Krikamol, David Balduzzi, and Bernhard Schölkopf. "Domain generalization via invariant feature representation." International Conference on Machine Learning. 2013.<br />
<br />
[12]: Fang, Chen, Ye Xu, and Daniel N. Rockmore. "Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias." Proceedings of the IEEE International Conference on Computer Vision. 2013.<br />
<br />
[13]: Everingham, Mark, et al. "The pascal visual object classes (voc) challenge." International journal of computer vision 88.2 (2010): 303-338.<br />
<br />
[14]: Russell, Bryan C., et al. "LabelMe: a database and web-based tool for image annotation." International journal of computer vision 77.1-3 (2008): 157-173.<br />
<br />
[15]: Fei-Fei, Li. "Learning generative visual models from few training examples." Workshop on Generative-Model Based Vision, IEEE Proc. CVPR, 2004. 2004.<br />
<br />
[16]: Chopra, Sumit, Raia Hadsell, and Yann LeCun. "Learning a similarity metric discriminatively, with application to face verification." 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 1. IEEE, 2005.<br />
<br />
[17]: Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. "Deeper, broader and artier domain generalization". IEEE International Conference on Computer Vision (ICCV), 2017. <br />
<br />
[18]: Carlucci, Fabio M., et al. "Domain generalization by solving jigsaw puzzles." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.<br />
<br />
[19]: https://github.com/biomedia-mira/masf</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUE&diff=45192SuperGLUE2020-11-17T21:46:34Z<p>X46yan: </p>
<hr />
<div><br />
== Presented by ==<br />
Shikhar Sakhuja<br />
<br />
== Introduction == <br />
Natural Language Processing (NLP) has seen immense improvements over the past two years. The improvements offered by RNN-based model such as ELMo [2], and Transformer [1] based models such as OpenAI GPT [3], BERT[4], etc. have revolutionized the field. These models render GLUE [5], the standard benchmark for NLP tasks, ineffective. The GLUE benchmark was released over a year ago and assessed NLP models using a single-number metric that summarized performance over some diverse tasks. However, the transformer-based models outperform the non-expert humans in several tasks. With transformer-based models achieving near-perfect scores on almost all tasks in GLUE and outperforming humans in some, there is a need for a new benchmark that involves harder and even more diverse language tasks. The authors release SuperGLUE as a new benchmark that has a more rigorous set of language understanding tasks. <br />
<br />
<br />
== Related Work == <br />
There have been several benchmarks attempting to standardize the field of language understanding tasks. SentEval [6] evaluated fixed-size sentence embeddings for tasks. DecaNLP [7] converts tasks into a general question-answering format. GLUE offers a much more flexible and extensible benchmark since it imposes no restrictions on model architectures or parameter sharing. <br />
<br />
GLUE has been the gold standard for language understanding tests since it’s release. In fact, the benchmark has promoted growth in language models with all the transformer-based models started with attempting to achieve high scores on GLUE. Original GPT and BERT models scored 72.8 and 80.2 on GLUE. The latest GPT and BERT models, however, far outperform these benchmarks and strike a need for a more robust and difficult benchmark. <br />
<br />
<br />
== Motivation ==<br />
Transformer based NLP models allow NLP models to train using transfer learning which was previously only seen in Computer Vision tasks and was notoriously difficult for language because of the discrete nature of words. Transfer Learning in NLP allows models to be trained over terabytes of language data in a self-supervised fashion. These models can then be finetuned for downstream tasks such as sentiment classification, fake news detection, etc. The fine-tuned models beat many of the human labellers who weren’t experts in the domain. Thus, creating a need for a newer, more robust baseline that can stay relevant with the rapid improvements in the field of NLP. <br />
<br />
[[File:loser glue.png]]<br />
<br />
Figure 1: Transformer-based models outperforming humans in GLUE tasks.<br />
<br />
== SuperGLUE Tasks ==<br />
<br />
SuperGLUE has 8 language understanding tasks. They test a model’s understanding of texts in English. The tasks are built to be equivalent to the capabilities of most college-educated English speakers and are beyond the capabilities of most state-of-the-art systems today. <br />
<br />
'''BoolQ''' (Boolean Questions [9]): QA task consisting of short passage and related questions to the passage as either a yes or a no answer. <br />
<br />
'''CB''' (CommitmentBank [10]): Corpus of text where sentences have embedded clauses and sentences are written with the goal of keeping the clause accurate. <br />
<br />
'''COPA''' (Choice of plausible Alternatives [11]): Reasoning tasks in which given a sentence the system must be able to choose the cause or effect of the sentence from two potential choices. <br />
<br />
'''MultiRC''' (Multi-Sentence Reading Comprehension [12]): QA task in which given a passage and potential answers, the model should label the answers as true or false. <br />
<br />
'''ReCoRD''' (Reading Comprehension with Commonsense Reasoning Dataset [13]): A multiple-choice, question answering task, where given a passage with a masked entity, the model should be able to predict the masked out entity from the choices.<br />
<br />
'''RTE''' (Recognizing Textual Entailment [14]): Classifying whether a text that can be plausibly inferred from a given passage. <br />
<br />
'''WiC''' (Word in Context [15]): Identifying whether a polysemous word used in multiple sentences is being used with the same sense across sentences or not. <br />
<br />
'''WSC''' (Winograd Schema Challenge, [16]): A conference resolution task where sentences include pronouns and noun phrases from the sentence. The goal is to identify the correct reference to a noun phrase corresponding to the pronoun.<br />
<br />
== Model Analysis ==<br />
SuperGLUE includes two tasks for analyzing linguistic knowledge and gender bias in models. To analyze linguistic and world knowledge, submissions to SuperBLUE are required to include predictions of sentence pair relation (entailment, not_entailment) on the resulting set for RTE task. As for gender bias, SuperGLUE includes a diagnostic dataset Winogender, which measures gender bias in co-reference resolution systems. A poor bias score indicates gender bias, however, a good score does not necessarily mean a model is unbiased. This is one limitation of the dataset. <br />
<br />
<br />
== Results ==<br />
<br />
Table 1 offers a summary of the results from SuperGLUE across different models. CBOW baselines are generally close to roughly chance performance. BERT, on the other hand, increased the SuperGLUE score by 25 points and had the highest improvement on most tasks, especially MultiRCC, ReCoRD, and RTE. WSC is trickier for BERT, potentially owing to the small dataset size. <br />
<br />
BERT++[8] increases BERT’s performance even further. However, achieving the goal of the benchmark, the best model/score still lags behind compared to human performance. The large gaps should be relatively tricky for models to close in on. The biggest margin is for WSC with 35 points and CV, RTE, BoolQ, WiC all have 10 point margins.<br />
<br />
<br />
[[File: 800px-SuperGLUE result.png]]<br />
<br />
Table 1: Baseline performance on SuperGLUE tasks.<br />
<br />
== Conclusion ==<br />
SuperGLUE fills the gap that GLUE has created owing to its inability to keep up with the SOTA in NLP. The new language tasks that the benchmark offers are built to be more robust and difficult to solve for NLP models. With the difference in model accuracy being around 10-35 points across all tasks, SuperGLUE is definitely going to be around for some time before the models catch up to it, as well. Overall, this is a significant contribution to improve general-purpose natural language understanding. <br />
<br />
== Critique == <br />
This is quite a fascinating read where the authors of the gold-standard benchmark have essentially conceded to the progress in NLP. Bowman’s team resorting to creating a new benchmark altogether to keep up with the rapid pace of increase in NLP makes me wonder if these benchmarks are inherently flawed. Applying the idea of Wittgenstein’s Ruler, are we measuring the performance of models using the benchmark, or the quality of benchmarks using the models? <br />
<br />
I’m curious how long SuperGLUE would stay relevant owing to advances in NLP. GPT-3, released in June 2020, has outperformed GPT-2 and BERT by a huge margin, given the 100x increase in parameters (175B Parameters over ~600GB for GPT-3, compared to 1.5B parameters over 40GB for GPT-2). In October 2020, a new deep learning technique (Pattern Exploiting Training) managed to train a Transformer NLP model with 223M parameters (roughly 0.01% parameters of GPT-3) and outperformed GPT-3 by 3 points on SuperGLUE. With the field improving so rapidly, I think superGLUE is nothing but a bandaid for the benchmarking tasks that will turn obsolete in no time.<br />
<br />
== References ==<br />
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.<br />
<br />
[2] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-1202. URL https://www.aclweb.org/anthology/N18-1202<br />
<br />
[3] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018. Unpublished ms. available through a link at https://blog.openai.com/language-unsupervised/.<br />
<br />
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https: //arxiv.org/abs/1810.04805.<br />
<br />
[5] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=rJ4km2R5t7.<br />
<br />
[6] Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. In Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association, 2018. URL https://www.aclweb.org/anthology/L18-1269.<br />
<br />
[7] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information processing Systems (NeurIPS). Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf.<br />
<br />
[8] Jason Phang, Thibault Févry, and Samuel R Bowman. Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. arXiv preprint 1811.01088, 2018. URL https://arxiv.org/abs/1811.01088.<br />
<br />
[9] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936,2019a.<br />
<br />
[10] Marie-Catherine de Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank: Investigating projection in naturally occurring discourse. 2019. To appear in Proceedings of Sinn und Bedeutung 23. Data can be found at https://github.com/mcdm/CommitmentBank/.<br />
<br />
[11] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.<br />
<br />
[12] Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language technologies (NAACL-HLT). Association for Computational Linguistics, 2018. URL https://www.aclweb.org/anthology/papers/N/N18/N18-1023/.<br />
<br />
[13] Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint 1810.12885, 2018.<br />
<br />
[14] Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment. Springer, 2006. URL https://link.springer.com/chapter/10.1007/11736790_9.<br />
<br />
[15] Mohammad Taher Pilehvar and Jose Camacho-Collados. WiC: The word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https://arxiv.org/abs/1808.09121.<br />
<br />
[16] Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012. URL http://dl.acm.org/citation.cfm?id=3031843.3031909.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUE&diff=45132SuperGLUE2020-11-17T15:56:59Z<p>X46yan: </p>
<hr />
<div><br />
== Presented by ==<br />
Shikhar Sakhuja<br />
<br />
== Introduction == <br />
Natural Language Processing (NLP) has seen immense improvements over the past two years. The improvements offered by RNN-based model such as ELMo [2], and Transformer [1] based models such as OpenAI GPT [3], BERT[4], etc. have revolutionized the field. These models render GLUE [5], the standard benchmark for NLP tasks, ineffective. The GLUE benchmark was released over a year ago and assessed NLP models using a single-number metric that summarized performance over some diverse tasks. However, the transformer-based models outperform the non-expert humans in several tasks. With transformer-based models achieving near-perfect scores on almost all tasks in GLUE and outperforming humans in some, there is a need for a new benchmark that involves harder and even more diverse language tasks. The authors release SuperGLUE as a new benchmark that has a more rigorous set of language understanding tasks. <br />
<br />
<br />
== Related Work == <br />
There have been several benchmarks attempting to standardize the field of language understanding tasks. SentEval [6] evaluated fixed-size sentence embeddings for tasks. DecaNLP [7] converts tasks into a general question-answering format. GLUE offers a much more flexible and extensible benchmark since it imposes no restrictions on model architectures or parameter sharing. <br />
<br />
GLUE has been the gold standard for language understanding tests since it’s release. In fact, the benchmark has promoted growth in language models with all the transformer-based models started with attempting to achieve high scores on GLUE. Original GPT and BERT models scored 72.8 and 80.2 on GLUE. The latest GPT and BERT models, however, far outperform these benchmarks and strike a need for a more robust and difficult benchmark. <br />
<br />
<br />
== Motivation ==<br />
Transformer based NLP models allow NLP models to train using transfer learning which was previously only seen in Computer Vision tasks and was notoriously difficult for language because of the discrete nature of words. Transfer Learning in NLP allows models to be trained over terabytes of language data in a self-supervised fashion. These models can then be finetuned for downstream tasks such as sentiment classification, fake news detection, etc. The fine-tuned models beat many of the human labellers who weren’t experts in the domain. Thus, creating a need for a newer, more robust baseline that can stay relevant with the rapid improvements in the field of NLP. <br />
<br />
[[File:loser glue.png]]<br />
<br />
Figure 1: Transformer-based models outperforming humans in GLUE tasks.<br />
<br />
== SuperGLUE Tasks ==<br />
<br />
SuperGLUE has 8 language understanding tasks. They test a model’s understanding of texts in English. The tasks are built to be equivalent to the capabilities of most college-educated English speakers and are beyond the capabilities of most state-of-the-art systems today. <br />
<br />
'''BoolQ''' (Boolean Questions [9]): QA task consisting of short passage and related questions to the passage as either a yes or a no answer. <br />
<br />
'''CB''' (CommitmentBank [10]): Corpus of text where sentences have embedded clauses and sentences are written with the goal of keeping the clause accurate. <br />
<br />
'''COPA''' (Choice of plausible Alternatives [11]): Reasoning tasks in which given a sentence the system must be able to choose the cause or effect of the sentence from two potential choices. <br />
<br />
'''MultiRC''' (Multi-Sentence Reading Comprehension [12]): QA task in which given a passage and potential answers, the model should label the answers as true or false. <br />
<br />
'''ReCoRD''' (Reading Comprehension with Commonsense Reasoning Dataset [13]): A multiple-choice, question answering task, where given a passage with a masked entity, the model should be able to predict the masked out entity from the choices.<br />
<br />
'''RTE''' (Recognizing Textual Entailment [14]): Classifying whether a text that can be plausibly inferred from a given passage. <br />
<br />
'''WiC''' (Word in Context [15]): Identifying whether a polysemous word used in multiple sentences is being used with the same sense across sentences or not. <br />
<br />
'''WSC''' (Winograd Schema Challenge, [16]): A conference resolution task where sentences include pronouns and noun phrases from the sentence. The goal is to identify the correct reference to a noun phrase corresponding to the pronoun.<br />
<br />
== Results ==<br />
<br />
Table 1 offers a summary of the results from SuperGLUE across different models. CBOW baselines are generally close to roughly chance performance. BERT, on the other hand, increased the SuperGLUE score by 25 points and had the highest improvement on most tasks, especially MultiRCC, ReCoRD, and RTE. WSC is trickier for BERT, potentially owing to the small dataset size. <br />
<br />
BERT++[8] increases BERT’s performance even further. However, achieving the goal of the benchmark, the best model/score still lags behind compared to human performance. The large gaps should be relatively tricky for models to close in on. The biggest margin is for WSC with 35 points and CV, RTE, BoolQ, WiC all have 10 point margins.<br />
<br />
<br />
[[File: 800px-SuperGLUE result.png]]<br />
<br />
Table 1: Baseline performance on SuperGLUE tasks.<br />
<br />
== Conclusion ==<br />
SuperGLUE fills the gap that GLUE has created owing to its inability to keep up with the SOTA in NLP. The new language tasks that the benchmark offers are built to be more robust and difficult to solve for NLP models. With the difference in model accuracy being around 10-35 points across all tasks, SuperGLUE is definitely going to be around for some time before the models catch up to it, as well. Overall, this is a significant contribution to improve general-purpose natural language understanding. <br />
<br />
== Critique == <br />
This is quite a fascinating read where the authors of the gold-standard benchmark have essentially conceded to the progress in NLP. Bowman’s team resorting to creating a new benchmark altogether to keep up with the rapid pace of increase in NLP makes me wonder if these benchmarks are inherently flawed. Applying the idea of Wittgenstein’s Ruler, are we measuring the performance of models using the benchmark, or the quality of benchmarks using the models? <br />
<br />
I’m curious how long SuperGLUE would stay relevant owing to advances in NLP. GPT-3, released in June 2020, has outperformed GPT-2 and BERT by a huge margin, given the 100x increase in parameters (175B Parameters over ~600GB for GPT-3, compared to 1.5B parameters over 40GB for GPT-2). In October 2020, a new deep learning technique (Pattern Exploiting Training) managed to train a Transformer NLP model with 223M parameters (roughly 0.01% parameters of GPT-3) and outperformed GPT-3 by 3 points on SuperGLUE. With the field improving so rapidly, I think superGLUE is nothing but a bandaid for the benchmarking tasks that will turn obsolete in no time.<br />
<br />
== References ==<br />
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.<br />
<br />
[2] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-1202. URL https://www.aclweb.org/anthology/N18-1202<br />
<br />
[3] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018. Unpublished ms. available through a link at https://blog.openai.com/language-unsupervised/.<br />
<br />
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https: //arxiv.org/abs/1810.04805.<br />
<br />
[5] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=rJ4km2R5t7.<br />
<br />
[6] Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. In Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association, 2018. URL https://www.aclweb.org/anthology/L18-1269.<br />
<br />
[7] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information processing Systems (NeurIPS). Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf.<br />
<br />
[8] Jason Phang, Thibault Févry, and Samuel R Bowman. Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. arXiv preprint 1811.01088, 2018. URL https://arxiv.org/abs/1811.01088.<br />
<br />
[9] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936,2019a.<br />
<br />
[10] Marie-Catherine de Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank: Investigating projection in naturally occurring discourse. 2019. To appear in Proceedings of Sinn und Bedeutung 23. Data can be found at https://github.com/mcdm/CommitmentBank/.<br />
<br />
[11] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.<br />
<br />
[12] Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language technologies (NAACL-HLT). Association for Computational Linguistics, 2018. URL https://www.aclweb.org/anthology/papers/N/N18/N18-1023/.<br />
<br />
[13] Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint 1810.12885, 2018.<br />
<br />
[14] Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment. Springer, 2006. URL https://link.springer.com/chapter/10.1007/11736790_9.<br />
<br />
[15] Mohammad Taher Pilehvar and Jose Camacho-Collados. WiC: The word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https://arxiv.org/abs/1808.09121.<br />
<br />
[16] Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012. URL http://dl.acm.org/citation.cfm?id=3031843.3031909.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=44348Dense Passage Retrieval for Open-Domain Question Answering2020-11-15T03:28:59Z<p>X46yan: </p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= Introduction =<br />
Open domain question answering is a task that finds question answers from a large collection of documents. Nowadays open domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of neural language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding, and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. <br />
<br />
= Background =<br />
The following example clearly shows what problems open domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader.<br />
<br />
= Dense Passage Retriever =<br />
This paper focuses on improving the retrieval component and proposed a framework called Dense Passage Retriever (DPR) which aims to efficiently retrieve the top K most relevant passages from a large passage collection. The key component of DPR is a dual-BERT model which encodes queries and passages in a vector space where relevant pairs of queries and passages are closer than irrelevant ones. <br />
<br />
== Model Architecture Overview ==<br />
DPR has two independent BERT encoders: a query encoder Eq, and a passage encoder Ep. They map each input sentence to a d dimensional real-valued vector, and the similarity between a query and a passage is defined as the dot product of their vectors. DPR uses the [CLS] token output as embedding vectors, so d = 768.<br />
<br />
== Training ==<br />
The training data can be viewed as m instances of (query, positive passage, negative passages) pairs. The loss function is defined as the negative log likelihood of the positive passages. [[File: dpr_loss_fn.png | 400px]]<br />
<br />
While positive passage selection is simple, where the passage contains the answer is selected, negative passage selection is less explicit. The authors experimented with three types of negative passages: (1) Random passage from the corpus; (2) false positive passages returned by BM25; (3) Gold positive passages from the training set — i.e., a positive passage for one query is considered as a negative passage for another query. The authors got the best model by using gold positive passages from the same batch as negatives. This trick is called in-batch negatives. Assume there are B pairs of (query q_i, positive passage p_i) in a mini-batch, then the negative passages for query q_i are the passages p_j when j is not equal to i. <br />
<br />
= Experimental Setup =<br />
The authors pre-processed Wikipedia documents and split each document into passages of length 100 words. These passages form a candidate pool. Five QA datasets are used: Natural Questions (NQ), TriviaQA, WebQuestions (WQ), CuratedTREC (TREC), and SQuAD v1.1. To build the training data, the authors match each question in the five datasets with a passage that contains the correct answer. The dataset statistics are summarized below. <br />
<br />
[[File: DPR_datasets.png | 600px]]<br />
<br />
= Retrieval Performance Evaluation =<br />
The authors trained DPR on five datasets separately, and on the combined dataset. They compared DPR performance with the performance of the term-frequency based model BM25, and BM25+DPR. The DPR performance is evaluated in terms of the retrieval accuracy, ablation study, qualitative analysis, and run-time efficiency.<br />
<br />
== Main Results ==<br />
The table below shows the retrieval accuracy on test sets. DPR generally performs better on datasets excluding SQuAD. The authors speculated that the high lexical overlap between queries and passages in SQuAD results in BM25 outperforming DPR.<br />
<br />
[[ File: retrieval_accuracy.png | 800px]]<br />
<br />
<br />
== Ablation Study on Model Training ==<br />
The authors further analyzed how different training options would influence the model performance. The five training options they studied are (1) Sample efficiency, (2) In-batch negative training, (3) Impact of gold passages, (4) Similarity and loss, and (5) Cross-dataset generalization. <br />
<br />
(1) '''Sample efficiency''' <br />
<br />
The authors examined how many training examples were needed to achieve good performance. The study showed that DPR trained with 1k examples already outperformed BM25. With more training data, DPR performs better.<br />
<br />
[[File: sample_efficiency.png | 500px ]]<br />
<br />
(2) '''In-batch negative training'''<br />
<br />
Three training schemes are evaluated on the development dataset. The first scheme, which is the standard 1-to-N training setting, is to pair each query with one positive passage and n negative passages. As mentioned before, there are three ways to select negative passages: random, BM25, and gold. The results showed that in this setting, the choices of negative passages did not have strong impact on the model performance. The top block of the table below shows the retrieval accuracy in the 1-to-N setting. The second scheme, which is called in-batch negative setting, is to use positive passages for other queries in the same batch as negatives. The middle block shows the in-batch negative training results. The performance is significantly improved compared to the first setting. The last scheme is to augment in-batch negative with addition hard negative passages that are high ranked by BM25, but do not contain the correct answer. The bottom block shows the result for this setting, and the authors found that adding one addition BM25 hard negatives works the best.<br />
<br />
[[File: training_scheme.png | 500px]]<br />
<br />
(3) '''Similarity and loss'''<br />
In this paper, the similarity between a query and a passage is measured by the dot product of the two vectors, and the loss function is defined as the negative log likelihood of the positive passages. The authors experimented with other similarity functions such as L2-norm and cosine distance, and other loss functions such as triplet loss. The results showed these options didn't improve the model performance much.<br />
<br />
(4) '''Cross-dataset generalization'''<br />
Cross-dataset generalization studies if a model trained on one dataset can perform well on other unseen datasets. The authors trained DPR on Natural Questions dataset and tested it on WebQuestions and CuratedTREC. The result showed DPR generalized well, with only 3~5 points loss. <br />
<br />
<br />
== Qualitative Analysis ==<br />
Since BM25 is a bag-of-words method which ranks passages based on term-frequency, it's good at exact keywords and phrase matching, while DPR can capture lexical variants and semantic relationships. Generally DPR outperforms BM25 on the test sets. <br />
<br />
<br />
== Run-time Efficiency ==<br />
The time required for generating dense embeddings and indexing passages is long. It took the authors 8.8 hours to encode 21-million passages on 8 GPUs, and 8.5 hours to index 21-million passages on a singe server. Conversely, building inverted index for 21-million passages only takes about 30 minutes. However, after the pre-processing is done, DPR can process 995 queries per second, while BM25 processes 23.7 queries per second.<br />
<br />
<br />
= Experiments: Question Answering =<br />
The authors evaluated DPR on end-to-end QA systems (i.e., retriever + neural reader). The results showed that higher retriever accuracy typically leads to better final QA results, and the passages retrieved by DPR are more likely to contain the correct answers. As shown in the table below, QA systems using DPR generally perform better, except for SQuAD.<br />
<br />
[[File: QA.png | 600px]]<br />
<br />
<br />
= Conclusion =<br />
In conclusion, this paper proposed a dense retrieval method which can generally outperform traditional bag-of-words methods in open domain question answering tasks. Dense retrieval methods make use of pre-training language model Bert and achieved state-of-art performance on many QA datasets. Dense retrieval methods are good at capturing word variants and semantic relationships, but are relatively weak at capturing exact keyword match. This paper also covers a few training techniques that are required to successfully train a dense retriever. Overall, learning dense representations for first stage retrieval can potentially improve QA system performance, and has received more attention. <br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=44347Dense Passage Retrieval for Open-Domain Question Answering2020-11-15T03:27:49Z<p>X46yan: </p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= 1. Introduction =<br />
Open domain question answering is a task that finds question answers from a large collection of documents. Nowadays open domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of neural language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding, and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. <br />
<br />
= 2. Background =<br />
The following example clearly shows what problems open domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader.<br />
<br />
= 3. Dense Passage Retriever =<br />
This paper focuses on improving the retrieval component and proposed a framework called Dense Passage Retriever (DPR) which aims to efficiently retrieve the top K most relevant passages from a large passage collection. The key component of DPR is a dual-BERT model which encodes queries and passages in a vector space where relevant pairs of queries and passages are closer than irrelevant ones. <br />
<br />
== 3.1 Model Architecture Overview ==<br />
DPR has two independent BERT encoders: a query encoder Eq, and a passage encoder Ep. They map each input sentence to a d dimensional real-valued vector, and the similarity between a query and a passage is defined as the dot product of their vectors. DPR uses the [CLS] token output as embedding vectors, so d = 768.<br />
<br />
== 3.2 Training ==<br />
The training data can be viewed as m instances of (query, positive passage, negative passages) pairs. The loss function is defined as the negative log likelihood of the positive passages. [[File: dpr_loss_fn.png | 400px]]<br />
<br />
While positive passage selection is simple, where the passage contains the answer is selected, negative passage selection is less explicit. The authors experimented with three types of negative passages: (1) Random passage from the corpus; (2) false positive passages returned by BM25; (3) Gold positive passages from the training set — i.e., a positive passage for one query is considered as a negative passage for another query. The authors got the best model by using gold positive passages from the same batch as negatives. This trick is called in-batch negatives. Assume there are B pairs of (query q_i, positive passage p_i) in a mini-batch, then the negative passages for query q_i are the passages p_j when j is not equal to i. <br />
<br />
= 4. Experimental Setup =<br />
The authors pre-processed Wikipedia documents and split each document into passages of length 100 words. These passages form a candidate pool. Five QA datasets are used: Natural Questions (NQ), TriviaQA, WebQuestions (WQ), CuratedTREC (TREC), and SQuAD v1.1. To build the training data, the authors match each question in the five datasets with a passage that contains the correct answer. The dataset statistics are summarized below. <br />
<br />
[[File: DPR_datasets.png | 600px]]<br />
<br />
= 5. Retrieval Performance Evaluation =<br />
The authors trained DPR on five datasets separately, and on the combined dataset. They compared DPR performance with the performance of the term-frequency based model BM25, and BM25+DPR. The DPR performance is evaluated in terms of the retrieval accuracy, ablation study, qualitative analysis, and run-time efficiency.<br />
<br />
== 5.1 Main Results ==<br />
The table below shows the retrieval accuracy on test sets. DPR generally performs better on datasets excluding SQuAD. The authors speculated that the high lexical overlap between queries and passages in SQuAD results in BM25 outperforming DPR.<br />
<br />
[[ File: retrieval_accuracy.png | 800px]]<br />
<br />
<br />
== 5.2 Ablation Study on Model Training ==<br />
The authors further analyzed how different training options would influence the model performance. The five training options they studied are (1) Sample efficiency, (2) In-batch negative training, (3) Impact of gold passages, (4) Similarity and loss, and (5) Cross-dataset generalization. <br />
<br />
(1) '''Sample efficiency''' <br />
<br />
The authors examined how many training examples were needed to achieve good performance. The study showed that DPR trained with 1k examples already outperformed BM25. With more training data, DPR performs better.<br />
<br />
[[File: sample_efficiency.png | 500px ]]<br />
<br />
(2) '''In-batch negative training'''<br />
<br />
Three training schemes are evaluated on the development dataset. The first scheme, which is the standard 1-to-N training setting, is to pair each query with one positive passage and n negative passages. As mentioned before, there are three ways to select negative passages: random, BM25, and gold. The results showed that in this setting, the choices of negative passages did not have strong impact on the model performance. The top block of the table below shows the retrieval accuracy in the 1-to-N setting. The second scheme, which is called in-batch negative setting, is to use positive passages for other queries in the same batch as negatives. The middle block shows the in-batch negative training results. The performance is significantly improved compared to the first setting. The last scheme is to augment in-batch negative with addition hard negative passages that are high ranked by BM25, but do not contain the correct answer. The bottom block shows the result for this setting, and the authors found that adding one addition BM25 hard negatives works the best.<br />
<br />
[[File: training_scheme.png | 500px]]<br />
<br />
(3) '''Similarity and loss'''<br />
In this paper, the similarity between a query and a passage is measured by the dot product of the two vectors, and the loss function is defined as the negative log likelihood of the positive passages. The authors experimented with other similarity functions such as L2-norm and cosine distance, and other loss functions such as triplet loss. The results showed these options didn't improve the model performance much.<br />
<br />
(4) '''Cross-dataset generalization'''<br />
Cross-dataset generalization studies if a model trained on one dataset can perform well on other unseen datasets. The authors trained DPR on Natural Questions dataset and tested it on WebQuestions and CuratedTREC. The result showed DPR generalized well, with only 3~5 points loss. <br />
<br />
<br />
== 5.3 Qualitative Analysis ==<br />
Since BM25 is a bag-of-words method which ranks passages based on term-frequency, it's good at exact keywords and phrase matching, while DPR can capture lexical variants and semantic relationships. Generally DPR outperforms BM25 on the test sets. <br />
<br />
<br />
== 5.4 Run-time Efficiency ==<br />
The time required for generating dense embeddings and indexing passages is long. It took the authors 8.8 hours to encode 21-million passages on 8 GPUs, and 8.5 hours to index 21-million passages on a singe server. Conversely, building inverted index for 21-million passages only takes about 30 minutes. However, after the pre-processing is done, DPR can process 995 queries per second, while BM25 processes 23.7 queries per second.<br />
<br />
<br />
= 6. Experiments: Question Answering =<br />
The authors evaluated DPR on end-to-end QA systems (i.e., retriever + neural reader). The results showed that higher retriever accuracy typically leads to better final QA results, and the passages retrieved by DPR are more likely to contain the correct answers. As shown in the table below, QA systems using DPR generally perform better, except for SQuAD.<br />
<br />
[[File: QA.png | 600px]]<br />
<br />
<br />
= 7. Conclusion =<br />
In conclusion, this paper proposed a dense retrieval method which can generally outperform traditional bag-of-words methods in open domain question answering tasks. Dense retrieval methods make use of pre-training language model Bert and achieved state-of-art performance on many QA datasets. Dense retrieval methods are good at capturing word variants and semantic relationships, but are relatively weak at capturing exact keyword match. This paper also covers a few training techniques that are required to successfully train a dense retriever. Overall, learning dense representations for first stage retrieval can potentially improve QA system performance, and has received more attention. <br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=44344Dense Passage Retrieval for Open-Domain Question Answering2020-11-15T03:21:37Z<p>X46yan: </p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= 1. Introduction =<br />
Open domain question answering is a task that finds question answers from a large collection of documents. Nowadays open domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of neural language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding, and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. <br />
<br />
= 2. Background =<br />
The following example clearly shows what problems open domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader.<br />
<br />
= 3. Dense Passage Retriever =<br />
This paper focuses on improving the retrieval component and proposed a framework called Dense Passage Retriever (DPR) which aims to efficiently retrieve the top K most relevant passages from a large passage collection. The key component of DPR is a dual-BERT model which encodes queries and passages in a vector space where relevant pairs of queries and passages are closer than irrelevant ones. <br />
<br />
== 3.1 Model Architecture Overview ==<br />
DPR has two independent BERT encoders: a query encoder Eq, and a passage encoder Ep. They map each input sentence to a d dimensional real-valued vector, and the similarity between a query and a passage is defined as the dot product of their vectors. DPR uses the [CLS] token output as embedding vectors, so d = 768.<br />
<br />
== 3.2 Training ==<br />
The training data can be viewed as m instances of (query, positive passage, negative passages) pairs. The loss function is defined as the negative log likelihood of the positive passages. [[File: dpr_loss_fn.png | 400px]]<br />
<br />
While positive passage selection is simple, where the passage contains the answer is selected, negative passage selection is less explicit. The authors experimented with three types of negative passages: (1) Random passage from the corpus; (2) false positive passages returned by BM25; (3) Gold positive passages from the training set — i.e., a positive passage for one query is considered as a negative passage for another query. The authors got the best model by using gold positive passages from the same batch as negatives. This trick is called in-batch negatives. Assume there are B pairs of (query q_i, positive passage p_i) in a mini-batch, then the negative passages for query q_i are the passages p_j when j is not equal to i. <br />
<br />
= 4. Experimental Setup =<br />
The authors pre-processed Wikipedia documents and split each document into passages of length 100 words. These passages form a candidate pool. Five QA datasets are used: Natural Questions (NQ), TriviaQA, WebQuestions (WQ), CuratedTREC (TREC), and SQuAD v1.1. To build the training data, the authors match each question in the five datasets with a passage that contains the correct answer. The dataset statistics are summarized below. <br />
<br />
[[File: DPR_datasets.png | 600px]]<br />
<br />
= 5. Retrieval Performance Evaluation =<br />
The authors trained DPR on five datasets separately, and on the combined dataset. They compared DPR performance with the performance of the term-frequency based model BM25, and BM25+DPR. The DPR performance is evaluated in terms of the retrieval accuracy, ablation study, qualitative analysis, and run-time efficiency.<br />
<br />
== 5.1 Main Results ==<br />
The table below shows the retrieval accuracy on test sets. DPR generally performs better on datasets excluding SQuAD. The authors speculated that the high lexical overlap between queries and passages in SQuAD results in BM25 outperforming DPR.<br />
<br />
[[ File: retrieval_accuracy.png | 800px]]<br />
<br />
<br />
== 5.2 Ablation Study on Model Training ==<br />
The authors further analyzed how different training options would influence the model performance. The five training options they studied are (1) Sample efficiency, (2) In-batch negative training, (3) Impact of gold passages, (4) Similarity and loss, and (5) Cross-dataset generalization. <br />
<br />
(1) '''Sample efficiency''' <br />
<br />
The authors examined how many training examples were needed to achieve good performance. The study showed that DPR trained with 1k examples already outperformed BM25. With more training data, DPR performs better.<br />
<br />
[[File: sample_efficiency.png | 500px ]]<br />
<br />
(2) '''In-batch negative training'''<br />
<br />
Three training schemes are evaluated on the development dataset. The first scheme, which is the standard 1-to-N training setting, is to pair each query with one positive passage and n negative passages. As mentioned before, there are three ways to select negative passages: random, BM25, and gold. The results showed that in this setting, the choices of negative passages did not have strong impact on the model performance. The top block of the table below shows the retrieval accuracy in the 1-to-N setting. The second scheme, which is called in-batch negative setting, is to use positive passages for other queries in the same batch as negatives. The middle block shows the in-batch negative training results. The performance is significantly improved compared to the first setting. The last scheme is to augment in-batch negative with addition hard negative passages that are high ranked by BM25, but do not contain the correct answer. The bottom block shows the result for this setting, and the authors found that adding one addition BM25 hard negatives works the best.<br />
<br />
[[File: training_scheme.png | 500px]]<br />
<br />
(3) '''Similarity and loss'''<br />
In this paper, the similarity between a query and a passage is measured by the dot product of the two vectors, and the loss function is defined as the negative log likelihood of the positive passages. The authors experimented with other similarity functions such as L2-norm and cosine distance, and other loss functions such as triplet loss. The results showed these options didn't improve the model performance much.<br />
<br />
(4) '''Cross-dataset generalization'''<br />
Cross-dataset generalization studies if a model trained on one dataset can perform well on other unseen datasets. The authors trained DPR on Natural Questions dataset and tested it on WebQuestions and CuratedTREC. The result showed DPR generalized well, with only 3~5 points loss. <br />
<br />
<br />
== 5.3 Qualitative Analysis ==<br />
Since BM25 is a bag-of-words method which ranks passages based on term-frequency, it's good at exact keywords and phrase matching, while DPR can capture lexical variants and semantic relationships. Generally DPR outperforms BM25 on the test sets. <br />
<br />
<br />
== 5.4 Run-time Efficiency ==<br />
The time required for generating dense embeddings and indexing passages is long. It took the authors 8.8 hours to encode 21-million passages on 8 GPUs, and 8.5 hours to index 21-million passages on a singe server. Conversely, building inverted index for 21-million passages only takes about 30 minutes. However, after the pre-processing is done, DPR can process 995 queries per second, while BM25 processes 23.7 queries per second.<br />
<br />
<br />
= 6. Experiments: Question Answering =<br />
The authors evaluated DPR on end-to-end QA systems (i.e., retriever + neural reader). The results showed that higher retriever accuracy typically leads to better final QA results, and the passages retrieved by DPR are more likely to contain the correct answers. As shown in the table below, QA systems using DPR generally perform better, except for SQuAD.<br />
<br />
[[File: QA.png | 600px]]<br />
<br />
<br />
= 7. Conclusion =<br />
In conclusion, this paper proposed a dense retrieval method which can generally outperform traditional bag-of-words methods in open domain question answering tasks. Dense retrieval methods make use of pre-training language model Bert and achieved state-of-art performance on many QA datasets. Dense retrieval methods are good at capturing word variants and semantic relationships, but are relatively weak at capturing exact keyword match. This paper also covers a few training techniques that are required to successfully train a dense retriever. <br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:QA.png&diff=44342File:QA.png2020-11-15T03:08:13Z<p>X46yan: </p>
<hr />
<div></div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=44341Dense Passage Retrieval for Open-Domain Question Answering2020-11-15T02:57:50Z<p>X46yan: </p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= 1. Introduction =<br />
Open domain question answering is a task that finds question answers from a large collection of documents. Nowadays open domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of neural language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding, and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. <br />
<br />
= 2. Background =<br />
The following example clearly shows what problems open domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader.<br />
<br />
= 3. Dense Passage Retriever =<br />
This paper focuses on improving the retrieval component and proposed a framework called Dense Passage Retriever (DPR) which aims to efficiently retrieve the top K most relevant passages from a large passage collection. The key component of DPR is a dual-BERT model which encodes queries and passages in a vector space where relevant pairs of queries and passages are closer than irrelevant ones. <br />
<br />
== 3.1 Model Architecture Overview ==<br />
DPR has two independent BERT encoders: a query encoder Eq, and a passage encoder Ep. They map each input sentence to a d dimensional real-valued vector, and the similarity between a query and a passage is defined as the dot product of their vectors. DPR uses the [CLS] token output as embedding vectors, so d = 768.<br />
<br />
== 3.2 Training ==<br />
The training data can be viewed as m instances of (query, positive passage, negative passages) pairs. The loss function is defined as the negative log likelihood of the positive passages. [[File: dpr_loss_fn.png | 400px]]<br />
<br />
While positive passage selection is simple, where the passage contains the answer is selected, negative passage selection is less explicit. The authors experimented with three types of negative passages: (1) Random passage from the corpus; (2) false positive passages returned by BM25; (3) Gold positive passages from the training set — i.e., a positive passage for one query is considered as a negative passage for another query. The authors got the best model by using gold positive passages from the same batch as negatives. This trick is called in-batch negatives. Assume there are B pairs of (query q_i, positive passage p_i) in a mini-batch, then the negative passages for query q_i are the passages p_j when j is not equal to i. <br />
<br />
= 4. Experimental Setup =<br />
The authors pre-processed Wikipedia documents and split each document into passages of length 100 words. These passages form a candidate pool. Five QA datasets are used: Natural Questions (NQ), TriviaQA, WebQuestions (WQ), CuratedTREC (TREC), and SQuAD v1.1. To build the training data, the authors match each question in the five datasets with a passage that contains the correct answer. The dataset statistics are summarized below. <br />
<br />
[[File: DPR_datasets.png | 600px]]<br />
<br />
= 5. Retrieval Performance Evaluation =<br />
The authors trained DPR on five datasets separately, and on the combined dataset. They compared DPR performance with the performance of the term-frequency based model BM25, and BM25+DPR. The DPR performance is evaluated in terms of the retrieval accuracy, ablation study, qualitative analysis, and run-time efficiency.<br />
<br />
== 5.1 Main Results ==<br />
The table below shows the retrieval accuracy on test sets. DPR generally performs better on datasets excluding SQuAD. The authors speculated that the high lexical overlap between queries and passages in SQuAD results in BM25 outperforming DPR.<br />
<br />
[[ File: retrieval_accuracy.png | 800px]]<br />
<br />
<br />
== 5.2 Ablation Study on Model Training ==<br />
The authors further analyzed how different training options would influence the model performance. The five training options they studied are (1) Sample efficiency, (2) In-batch negative training, (3) Impact of gold passages, (4) Similarity and loss, and (5) Cross-dataset generalization. <br />
<br />
(1) '''Sample efficiency''' <br />
<br />
The authors examined how many training examples were needed to achieve good performance. The study showed that DPR trained with 1k examples already outperformed BM25. With more training data, DPR performs better.<br />
<br />
[[File: sample_efficiency.png | 500px ]]<br />
<br />
(2) '''In-batch negative training'''<br />
<br />
Three training schemes are evaluated on the development dataset. The first scheme, which is the standard 1-to-N training setting, is to pair each query with one positive passage and n negative passages. As mentioned before, there are three ways to select negative passages: random, BM25, and gold. The results showed that in this setting, the choices of negative passages did not have strong impact on the model performance. The top block of the table below shows the retrieval accuracy in the 1-to-N setting. The second scheme, which is called in-batch negative setting, is to use positive passages for other queries in the same batch as negatives. The middle block shows the in-batch negative training results. The performance is significantly improved compared to the first setting. The last scheme is to augment in-batch negative with addition hard negative passages that are high ranked by BM25, but do not contain the correct answer. The bottom block shows the result for this setting, and the authors found that adding one addition BM25 hard negatives works the best.<br />
<br />
[[File: training_scheme.png | 500px]]<br />
<br />
(3) '''Similarity and loss'''<br />
In this paper, the similarity between a query and a passage is measured by the dot product of the two vectors, and the loss function is defined as the negative log likelihood of the positive passages. The authors experimented with other similarity functions such as L2-norm and cosine distance, and other loss functions such as triplet loss. The results showed these options didn't improve the model performance much.<br />
<br />
(4) '''Cross-dataset generalization'''<br />
Cross-dataset generalization studies if a model trained on one dataset can perform well on other unseen datasets. The authors trained DPR on Natural Questions dataset and tested it on WebQuestions and CuratedTREC. The result showed DPR generalized well, with only 3~5 points loss. <br />
<br />
<br />
== 5.3 Qualitative Analysis ==<br />
Since BM25 is a bag-of-words method which ranks passages based on term-frequency, it's good at exact keywords and phrase matching, while DPR can capture lexical variants and semantic relationships. Generally DPR outperforms BM25 on the test sets. <br />
<br />
<br />
== 5.4 Run-time Efficiency ==<br />
The time required for generating dense embeddings and indexing passages is long. It took the authors 8.8 hours to encode 21-million passages on 8 GPUs, and 8.5 hours to index 21-million passages on a singe server. Conversely, building inverted index for 21-million passages only takes about 30 minutes. However, after the pre-processing is done, DPR can process 995 queries per second, while BM25 processes 23.7 queries per second.<br />
<br />
<br />
= 6. Experiments: Question Answering =<br />
<br />
= 7. Related Work =<br />
<br />
= 8. Conclusion =<br />
<br />
= Critiques =<br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:training_scheme.png&diff=44323File:training scheme.png2020-11-15T02:14:17Z<p>X46yan: </p>
<hr />
<div></div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=44299Dense Passage Retrieval for Open-Domain Question Answering2020-11-15T01:16:49Z<p>X46yan: </p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= 1. Introduction =<br />
Open domain question answering is a task that finds question answers from a large collection of documents. Nowadays open domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of neural language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding, and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. <br />
<br />
= 2. Background =<br />
The following example clearly shows what problems open domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader.<br />
<br />
= 3. Dense Passage Retriever =<br />
This paper focuses on improving the retrieval component and proposed a framework called Dense Passage Retriever (DPR) which aims to efficiently retrieve the top K most relevant passages from a large passage collection. The key component of DPR is a dual-BERT model which encodes queries and passages in a vector space where relevant pairs of queries and passages are closer than irrelevant ones. <br />
<br />
== 3.1 Model Architecture Overview ==<br />
DPR has two independent BERT encoders: a query encoder Eq, and a passage encoder Ep. They map each input sentence to a d dimensional real-valued vector, and the similarity between a query and a passage is defined as the dot product of their vectors. DPR uses the [CLS] token output as embedding vectors, so d = 768.<br />
<br />
== 3.2 Training ==<br />
The training data can be viewed as m instances of (query, positive passage, negative passages) pairs. The loss function is defined as the negative log likelihood of the positive passages. [[File: dpr_loss_fn.png | 400px]]<br />
<br />
While positive passage selection is simple, where the passage contains the answer is selected, negative passage selection is less explicit. The authors experimented with three types of negative passages: (1) Random passage from the corpus; (2) false positive passages returned by BM25; (3) Gold positive passages from the training set — i.e., a positive passage for one query is considered as a negative passage for another query. The authors got the best model by using gold positive passages from the same batch as negatives. This trick is called in-batch negatives. Assume there are B pairs of (query q_i, positive passage p_i) in a mini-batch, then the negative passages for query q_i are the passages p_j when j is not equal to i. <br />
<br />
= 4. Experimental Setup =<br />
The authors pre-processed Wikipedia documents and split each document into passages of length 100 words. These passages form a candidate pool. Five QA datasets are used: Natural Questions (NQ), TriviaQA, WebQuestions (WQ), CuratedTREC (TREC), and SQuAD v1.1. To build the training data, the authors match each question in the five datasets with a passage that contains the correct answer. The dataset statistics are summarized below. <br />
<br />
[[File: DPR_datasets.png | 600px]]<br />
<br />
= 5. Retrieval Performance Evaluation =<br />
The authors trained DPR on five datasets separately, and on the combined dataset. They compared DPR performance with the performance of the term-frequency based model BM25, and BM25+DPR. The DPR performance is evaluated in terms of the retrieval accuracy, ablation study, qualitative analysis, and run-time efficiency.<br />
<br />
== 5.1 Main Results ==<br />
The table below shows the retrieval accuracy on test sets. DPR generally performs better on datasets excluding SQuAD. The authors speculated that the high lexical overlap between queries and passages in SQuAD results in BM25 outperforming DPR.<br />
<br />
[[ File: retrieval_accuracy.png | 800px]]<br />
<br />
<br />
== 5.2 Ablation Study on Model Training ==<br />
The authors further analyzed how different training options would influence the model performance. The five training options they studied are (1) Sample efficiency, (2) In-batch negative training, (3) Impact of gold passages, (4) Similarity and loss, and (5) Cross-dataset generalization. <br />
<br />
(1) '''Sample efficiency''' <br />
<br />
The authors examined how many training examples were needed to achieve good performance. The study showed that DPR trained with 1k examples already outperformed BM25. With more training data, DPR performs better.<br />
<br />
[[File: sample_efficiency.png | 500px ]]<br />
<br />
(2) '''In-batch negative training'''<br />
<br />
<br />
<br />
== 5.3 Qualitative Analysis ==<br />
<br />
== 5.4 Run-time Efficiency ==<br />
<br />
= 6. Experiments: Question Answering =<br />
<br />
= 7. Related Work =<br />
<br />
= 8. Conclusion =<br />
<br />
= Critiques =<br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=44295Dense Passage Retrieval for Open-Domain Question Answering2020-11-15T01:06:55Z<p>X46yan: </p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= 1. Introduction =<br />
Open domain question answering is a task that finds question answers from a large collection of documents. Nowadays open domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of neural language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding, and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. <br />
<br />
= 2. Background =<br />
The following example clearly shows what problems open domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader.<br />
<br />
= 3. Dense Passage Retriever =<br />
This paper focuses on improving the retrieval component and proposed a framework called Dense Passage Retriever (DPR) which aims to efficiently retrieve the top K most relevant passages from a large passage collection. The key component of DPR is a dual-BERT model which encodes queries and passages in a vector space where relevant pairs of queries and passages are closer than irrelevant ones. <br />
<br />
== 3.1 Model Architecture Overview ==<br />
DPR has two independent BERT encoders: a query encoder Eq, and a passage encoder Ep. They map each input sentence to a d dimensional real-valued vector, and the similarity between a query and a passage is defined as the dot product of their vectors. DPR uses the [CLS] token output as embedding vectors, so d = 768.<br />
<br />
== 3.2 Training ==<br />
The training data can be viewed as m instances of (query, positive passage, negative passages) pairs. The loss function is defined as the negative log likelihood of the positive passages. [[File: dpr_loss_fn.png | 400px]]<br />
<br />
While positive passage selection is simple, where the passage contains the answer is selected, negative passage selection is less explicit. The authors experimented with three types of negative passages: (1) Random passage from the corpus; (2) false positive passages returned by BM25; (3) Gold positive passages from the training set — i.e., a positive passage for one query is considered as a negative passage for another query. The authors got the best model by using gold positive passages from the same batch as negatives. This trick is called in-batch negatives. Assume there are B pairs of (query q_i, positive passage p_i) in a mini-batch, then the negative passages for query q_i are the passages p_j when j is not equal to i. <br />
<br />
= 4. Experimental Setup =<br />
The authors pre-processed Wikipedia documents and split each document into passages of length 100 words. These passages form a candidate pool. Five QA datasets are used: Natural Questions (NQ), TriviaQA, WebQuestions (WQ), CuratedTREC (TREC), and SQuAD v1.1. To build the training data, the authors match each question in the five datasets with a passage that contains the correct answer. The dataset statistics are summarized below. <br />
<br />
[[File: DPR_datasets.png | 600px]]<br />
<br />
= 5. Retrieval Performance Evaluation =<br />
The authors trained DPR on five datasets separately, and on the combined dataset. They compared DPR performance with the performance of the term-frequency based model BM25, and BM25+DPR. <br />
<br />
== 5.1 Main Results ==<br />
The table below shows the retrieval accuracy on test sets. DPR generally performs better on datasets excluding SQuAD. The authors speculated that the high lexical overlap between queries and passages in SQuAD results in BM25 outperforming DPR.<br />
<br />
[[ File: retrieval_accuracy.png | 800px]]<br />
<br />
<br />
== 5.2 Ablation Study on Model Training ==<br />
The authors further analyzed how different training options would influence the model performance. The five training options they studied are (1) Sample efficiency, (2) In-batch negative training, (3) Impact of gold passages, (4) Similarity and loss, and (5) Cross-dataset generalization. <br />
<br />
(1) '''Sample efficiency''' <br />
<br />
The authors examined how many training examples were needed to achieve good performance. The study showed that DPR trained with 1k examples already outperformed BM25. <br />
<br />
[[File: sample_efficiency.png | 500px ]]<br />
<br />
== 5.3 Qualitative Analysis ==<br />
<br />
== 5.4 Run-time Efficiency ==<br />
<br />
= 6. Experiments: Question Answering =<br />
<br />
= 7. Related Work =<br />
<br />
= 8. Conclusion =<br />
<br />
= Critiques =<br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:sample_efficiency.png&diff=44294File:sample efficiency.png2020-11-15T01:06:35Z<p>X46yan: </p>
<hr />
<div></div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=44281Dense Passage Retrieval for Open-Domain Question Answering2020-11-15T00:55:42Z<p>X46yan: </p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= 1. Introduction =<br />
Open domain question answering is a task that finds question answers from a large collection of documents. Nowadays open domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of neural language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding, and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. <br />
<br />
= 2. Background =<br />
The following example clearly shows what problems open domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader.<br />
<br />
= 3. Dense Passage Retriever =<br />
This paper focuses on improving the retrieval component and proposed a framework called Dense Passage Retriever (DPR) which aims to efficiently retrieve the top K most relevant passages from a large passage collection. The key component of DPR is a dual-BERT model which encodes queries and passages in a vector space where relevant pairs of queries and passages are closer than irrelevant ones. <br />
<br />
== 3.1 Model Architecture Overview ==<br />
DPR has two independent BERT encoders: a query encoder Eq, and a passage encoder Ep. They map each input sentence to a d dimensional real-valued vector, and the similarity between a query and a passage is defined as the dot product of their vectors. DPR uses the [CLS] token output as embedding vectors, so d = 768.<br />
<br />
== 3.2 Training ==<br />
The training data can be viewed as m instances of (query, positive passage, negative passages) pairs. The loss function is defined as the negative log likelihood of the positive passages. [[File: dpr_loss_fn.png | 400px]]<br />
<br />
While positive passage selection is simple, where the passage contains the answer is selected, negative passage selection is less explicit. The authors experimented with three types of negative passages: (1) Random passage from the corpus; (2) false positive passages returned by BM25; (3) Gold positive passages from the training set — i.e., a positive passage for one query is considered as a negative passage for another query. The authors got the best model by using gold positive passages from the same batch as negatives. This trick is called in-batch negatives. Assume there are B pairs of (query q_i, positive passage p_i) in a mini-batch, then the negative passages for query q_i are the passages p_j when j is not equal to i. <br />
<br />
= 4. Experimental Setup =<br />
The authors pre-processed Wikipedia documents and split each document into passages of length 100 words. These passages form a candidate pool. Five QA datasets are used: Natural Questions (NQ), TriviaQA, WebQuestions (WQ), CuratedTREC (TREC), and SQuAD v1.1. To build the training data, the authors match each question in the five datasets with a passage that contains the correct answer. The dataset statistics are summaries below. <br />
<br />
[[File: DPR_datasets.png | 600px]]<br />
<br />
= 5. Retrieval Performance Evaluation =<br />
The authors trained DPR on five datasets separately, and on the combined dataset. They compared DPR performance with the performance of the term-frequency based model BM25, and BM25+DPR. <br />
<br />
== 5.1 Main Results ==<br />
The table below shows the retrieval accuracy on test sets. DPR generally performs better on datasets excluding SQuAD. The authors speculated that the high lexical overlap between queries and passages in SQuAD results in BM25 outperforming DPR.<br />
<br />
[[ File: retrieval_accuracy.png | 800px]]<br />
<br />
<br />
== 5.2 Ablation Study on Model Training ==<br />
<br />
== 5.3 Qualitative Analysis ==<br />
<br />
== 5.4 Run-time Efficiency ==<br />
<br />
= 6. Experiments: Question Answering =<br />
<br />
= 7. Related Work =<br />
<br />
= 8. Conclusion =<br />
<br />
= Critiques =<br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=44280Dense Passage Retrieval for Open-Domain Question Answering2020-11-15T00:55:20Z<p>X46yan: </p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= 1. Introduction =<br />
Open domain question answering is a task that finds question answers from a large collection of documents. Nowadays open domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of neural language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding, and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. <br />
<br />
= 2. Background =<br />
The following example clearly shows what problems open domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader.<br />
<br />
= 3. Dense Passage Retriever =<br />
This paper focuses on improving the retrieval component and proposed a framework called Dense Passage Retriever (DPR) which aims to efficiently retrieve the top K most relevant passages from a large passage collection. The key component of DPR is a dual-BERT model which encodes queries and passages in a vector space where relevant pairs of queries and passages are closer than irrelevant ones. <br />
<br />
== 3.1 Model Architecture Overview ==<br />
DPR has two independent BERT encoders: a query encoder Eq, and a passage encoder Ep. They map each input sentence to a d dimensional real-valued vector, and the similarity between a query and a passage is defined as the dot product of their vectors. DPR uses the [CLS] token output as embedding vectors, so d = 768.<br />
<br />
== 3.2 Training ==<br />
The training data can be viewed as m instances of (query, positive passage, negative passages) pairs. The loss function is defined as the negative log likelihood of the positive passages. [[File: dpr_loss_fn.png | 400px]]<br />
<br />
While positive passage selection is simple, where the passage contains the answer is selected, negative passage selection is less explicit. The authors experimented with three types of negative passages: (1) Random passage from the corpus; (2) false positive passages returned by BM25; (3) Gold positive passages from the training set — i.e., a positive passage for one query is considered as a negative passage for another query. The authors got the best model by using gold positive passages from the same batch as negatives. This trick is called in-batch negatives. Assume there are B pairs of (query q_i, positive passage p_i) in a mini-batch, then the negative passages for query q_i are the passages p_j when j is not equal to i. <br />
<br />
= 4. Experimental Setup =<br />
The authors pre-processed Wikipedia documents and split each document into passages of length 100 words. These passages form a candidate pool. Five QA datasets are used: Natural Questions (NQ), TriviaQA, WebQuestions (WQ), CuratedTREC (TREC), and SQuAD v1.1. To build the training data, the authors match each question in the five datasets with a passage that contains the correct answer. The dataset statistics are summaries below. [[File: DPR_datasets.png | 600px]]<br />
<br />
= 5. Retrieval Performance Evaluation =<br />
The authors trained DPR on five datasets separately, and on the combined dataset. They compared DPR performance with the performance of the term-frequency based model BM25, and BM25+DPR. <br />
<br />
== 5.1 Main Results ==<br />
The table below shows the retrieval accuracy on test sets. DPR generally performs better on datasets excluding SQuAD. The authors speculated that the high lexical overlap between queries and passages in SQuAD results in BM25 outperforming DPR.<br />
<br />
[[ File: retrieval_accuracy.png | 800px]]<br />
<br />
<br />
== 5.2 Ablation Study on Model Training ==<br />
<br />
== 5.3 Qualitative Analysis ==<br />
<br />
== 5.4 Run-time Efficiency ==<br />
<br />
= 6. Experiments: Question Answering =<br />
<br />
= 7. Related Work =<br />
<br />
= 8. Conclusion =<br />
<br />
= Critiques =<br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:DPR_datasets.png&diff=44279File:DPR datasets.png2020-11-15T00:54:45Z<p>X46yan: </p>
<hr />
<div></div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=44270Dense Passage Retrieval for Open-Domain Question Answering2020-11-15T00:45:27Z<p>X46yan: </p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= 1. Introduction =<br />
Open domain question answering is a task that finds question answers from a large collection of documents. Nowadays open domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of neural language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding, and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. <br />
<br />
= 2. Background =<br />
The following example clearly shows what problems open domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader.<br />
<br />
= 3. Dense Passage Retriever =<br />
This paper focuses on improving the retrieval component and proposed a framework called Dense Passage Retriever (DPR) which aims to efficiently retrieve the top K most relevant passages from a large passage collection. The key component of DPR is a dual-BERT model which encodes queries and passages in a vector space where relevant pairs of queries and passages are closer than irrelevant ones. <br />
<br />
== 3.1 Model Architecture Overview ==<br />
DPR has two independent BERT encoders: a query encoder Eq, and a passage encoder Ep. They map each input sentence to a d dimensional real-valued vector, and the similarity between a query and a passage is defined as the dot product of their vectors. DPR uses the [CLS] token output as embedding vectors, so d = 768.<br />
<br />
== 3.2 Training ==<br />
The training data can be viewed as m instances of (query, positive passage, negative passages) pairs. The loss function is defined as the negative log likelihood of the positive passages. [[File: dpr_loss_fn.png | 400px]]<br />
<br />
While positive passage selection is simple, where the passage contains the answer is selected, negative passage selection is less explicit. The authors experimented with three types of negative passages: (1) Random passage from the corpus; (2) false positive passages returned by BM25; (3) Gold positive passages from the training set — i.e., a positive passage for one query is considered as a negative passage for another query. The authors got the best model by using gold positive passages from the same batch as negatives. This trick is called in-batch negatives. Assume there are B pairs of (query q_i, positive passage p_i) in a mini-batch, then the negative passages for query q_i are the passages p_j when j is not equal to i. <br />
<br />
= 4. Experimental Setup =<br />
The authors pre-processed Wikipedia documents and split each document into passages of length 100 words. These passages form a candidate pool. Five QA datasets are used: Natural Questions (NQ), TriviaQA, WebQuestions (WQ), CuratedTREC (TREC), and SQuAD v1.1. The authors match each question in the five datasets with a passage that contains the correct answer. <br />
<br />
= 5. Retrieval Performance Evaluation =<br />
The authors trained DPR on five datasets separately, and on the combined dataset. They compared DPR performance with the performance of the term-frequency based model BM25, and BM25+DPR. The table below shows the retrieval accuracy on test sets. DPR generally performs better on datasets except SQuAD.<br />
<br />
[[ File: retrieval_accuracy.png | 800px]]<br />
<br />
== 5.1 Main Results ==<br />
<br />
== 5.2 Ablation Study on Model Training ==<br />
<br />
== 5.3 Qualitative Analysis ==<br />
<br />
== 5.4 Run-time Efficiency ==<br />
<br />
= 6. Experiments: Question Answering =<br />
<br />
= 7. Related Work =<br />
<br />
= 8. Conclusion =<br />
<br />
= Critiques =<br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=44267Dense Passage Retrieval for Open-Domain Question Answering2020-11-15T00:43:53Z<p>X46yan: </p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= 1. Introduction =<br />
Open domain question answering is a task that finds question answers from a large collection of documents. Nowadays open domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of neural language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding, and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. <br />
<br />
= 2. Background =<br />
The following example clearly shows what problems open domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader.<br />
<br />
= 3. Dense Passage Retriever =<br />
This paper focuses on improving the retrieval component and proposed a framework called Dense Passage Retriever (DPR) which aims to efficiently retrieve the top K most relevant passages from a large passage collection. The key component of DPR is a dual-BERT model which encodes queries and passages in a vector space where relevant pairs of queries and passages are closer than irrelevant ones. <br />
<br />
== 3.1 Model Architecture Overview ==<br />
DPR has two independent BERT encoders: a query encoder Eq, and a passage encoder Ep. They map each input sentence to a d dimensional real-valued vector, and the similarity between a query and a passage is defined as the dot product of their vectors. DPR uses the [CLS] token output as embedding vectors, so d = 768.<br />
<br />
== 3.2 Training ==<br />
The training data can be viewed as m instances of (query, positive passage, negative passages) pairs. The loss function is defined as the negative log likelihood of the positive passages. [[File: dpr_loss_fn.png | 400px]]<br />
<br />
While positive passage selection is simple, where the passage contains the answer is selected, negative passage selection is less explicit. The authors experimented with three types of negative passages: (1) Random passage from the corpus; (2) false positive passages returned by BM25; (3) Gold positive passages from the training set — i.e., a positive passage for one query is considered as a negative passage for another query. The authors got the best model by using gold positive passages from the same batch as negatives. This trick is called in-batch negatives. Assume there are B pairs of (query q_i, positive passage p_i) in a mini-batch, then the negative passages for query q_i are the passages p_j when j is not equal to i. <br />
<br />
= 4. Experimental Setup =<br />
The authors pre-processed Wikipedia documents and split each document into passages of length 100 words. These passages form a candidate pool. Five QA datasets are used: Natural Questions (NQ), TriviaQA, WebQuestions (WQ), CuratedTREC (TREC), and SQuAD v1.1. The authors match each question in the five datasets with a passage that contains the correct answer. <br />
<br />
= 5. Retrieval Performance Evaluation =<br />
The authors trained DPR on five datasets separately, and on the combined dataset. They compared DPR performance with the performance of the term-frequency based model BM25, and BM25+DPR. The table below shows the retrieval accuracy on test sets. DPR generally performs better on datasets except SQuAD.<br />
<br />
[[ File: retrieval_accuracy.png | 500px]]<br />
<br />
== 5.1 Main Results ==<br />
<br />
== 5.2 Ablation Study on Model Training ==<br />
<br />
== 5.3 Qualitative Analysis ==<br />
<br />
== 5.4 Run-time Efficiency ==<br />
<br />
= 6. Experiments: Question Answering =<br />
<br />
= 7. Related Work =<br />
<br />
= 8. Conclusion =<br />
<br />
= Critiques =<br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=44265Dense Passage Retrieval for Open-Domain Question Answering2020-11-15T00:43:34Z<p>X46yan: </p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= 1. Introduction =<br />
Open domain question answering is a task that finds question answers from a large collection of documents. Nowadays open domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of neural language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding, and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. <br />
<br />
= 2. Background =<br />
The following example clearly shows what problems open domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader.<br />
<br />
= 3. Dense Passage Retriever =<br />
This paper focuses on improving the retrieval component and proposed a framework called Dense Passage Retriever (DPR) which aims to efficiently retrieve the top K most relevant passages from a large passage collection. The key component of DPR is a dual-BERT model which encodes queries and passages in a vector space where relevant pairs of queries and passages are closer than irrelevant ones. <br />
<br />
== 3.1 Model Architecture Overview ==<br />
DPR has two independent BERT encoders: a query encoder Eq, and a passage encoder Ep. They map each input sentence to a d dimensional real-valued vector, and the similarity between a query and a passage is defined as the dot product of their vectors. DPR uses the [CLS] token output as embedding vectors, so d = 768.<br />
<br />
== 3.2 Training ==<br />
The training data can be viewed as m instances of (query, positive passage, negative passages) pairs. The loss function is defined as the negative log likelihood of the positive passages. [[File: dpr_loss_fn.png | 400px]]<br />
<br />
While positive passage selection is simple, where the passage contains the answer is selected, negative passage selection is less explicit. The authors experimented with three types of negative passages: (1) Random passage from the corpus; (2) false positive passages returned by BM25; (3) Gold positive passages from the training set — i.e., a positive passage for one query is considered as a negative passage for another query. The authors got the best model by using gold positive passages from the same batch as negatives. This trick is called in-batch negatives. Assume there are B pairs of (query q_i, positive passage p_i) in a mini-batch, then the negative passages for query q_i are the passages p_j when j is not equal to i. <br />
<br />
= 4. Experimental Setup =<br />
The authors pre-processed Wikipedia documents and split each document into passages of length 100 words. These passages form a candidate pool. Five QA datasets are used: Natural Questions (NQ), TriviaQA, WebQuestions (WQ), CuratedTREC (TREC), and SQuAD v1.1. The authors match each question in the five datasets with a passage that contains the correct answer. <br />
<br />
= 5. Retrieval Performance Evaluation =<br />
The authors trained DPR on five datasets separately, and on the combined dataset. They compared DPR performance with the performance of the term-frequency based model BM25, and BM25+DPR. The table below shows the retrieval accuracy on test sets. DPR generally performs better on datasets except SQuAD.<br />
<br />
[[ File: retrieval_accuracy.png ]]<br />
<br />
== 5.1 Main Results ==<br />
<br />
== 5.2 Ablation Study on Model Training ==<br />
<br />
== 5.3 Qualitative Analysis ==<br />
<br />
== 5.4 Run-time Efficiency ==<br />
<br />
= 6. Experiments: Question Answering =<br />
<br />
= 7. Related Work =<br />
<br />
= 8. Conclusion =<br />
<br />
= Critiques =<br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:retrieval_accuracy.png&diff=44263File:retrieval accuracy.png2020-11-15T00:41:49Z<p>X46yan: </p>
<hr />
<div></div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=44219Dense Passage Retrieval for Open-Domain Question Answering2020-11-14T22:50:03Z<p>X46yan: </p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= 1. Introduction =<br />
Open domain question answering is a task that finds question answers from a large collection of documents. Nowadays open domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of neural language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding, and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. <br />
<br />
= 2. Background =<br />
The following example clearly shows what problems open domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader.<br />
<br />
= 3. Dense Passage Retriever =<br />
This paper focuses on improving the retrieval component and proposed a framework called Dense Passage Retriever (DPR) which aims to efficiently retrieve the top K most relevant passages from a large passage collection. The key component of DPR is a dual-BERT model which encodes queries and passages in a vector space where relevant pairs of queries and passages are closer than irrelevant ones. <br />
<br />
== 3.1 Model Architecture Overview ==<br />
DPR has two independent BERT encoders: a query encoder Eq, and a passage encoder Ep. They map each input sentence to a d dimensional real-valued vector, and the similarity between a query and a passage is defined as the dot product of their vectors. DPR uses the [CLS] token output as embedding vectors, so d = 768.<br />
<br />
== 3.2 Training ==<br />
The training data can be viewed as m instances of (query, positive passage, negative passages) pairs. The loss function is defined as the negative log likelihood of the positive passages. [[File: dpr_loss_fn.png | 400px]]<br />
<br />
= 4. Experimental Setup =<br />
<br />
= 5. Retrieval Performance Evaluation =<br />
<br />
== 5.1 Main Results ==<br />
<br />
== 5.2 Ablation Study on Model Training ==<br />
<br />
== 5.3 Qualitative Analysis ==<br />
<br />
== 5.4 Run-time Efficiency ==<br />
<br />
= 6. Experiments: Question Answering =<br />
<br />
= 7. Related Work =<br />
<br />
= 8. Conclusion =<br />
<br />
= Critiques =<br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=44218Dense Passage Retrieval for Open-Domain Question Answering2020-11-14T22:49:42Z<p>X46yan: </p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= 1. Introduction =<br />
Open domain question answering is a task that finds question answers from a large collection of documents. Nowadays open domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of neural language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding, and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. <br />
<br />
= 2. Background =<br />
The following example clearly shows what problems open domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader.<br />
<br />
= 3. Dense Passage Retriever =<br />
This paper focuses on improving the retrieval component and proposed a framework called Dense Passage Retriever (DPR) which aims to efficiently retrieve the top K most relevant passages from a large passage collection. The key component of DPR is a dual-BERT model which encodes queries and passages in a vector space where relevant pairs of queries and passages are closer than irrelevant ones. <br />
<br />
== 3.1 Model Architecture Overview ==<br />
DPR has two independent BERT encoders: a query encoder Eq, and a passage encoder Ep. They map each input sentence to a d dimensional real-valued vector, and the similarity between a query and a passage is defined as the dot product of their vectors. DPR uses the [CLS] token output as embedding vectors, so d = 768.<br />
<br />
== 3.2 Training ==<br />
The training data can be viewed as m instances of (query, positive passage, negative passages) pairs. The loss function is defined as the negative log likelihood of the positive passages. [[File: dpr_loss_fn.png | 200px]]<br />
<br />
= 4. Experimental Setup =<br />
<br />
= 5. Retrieval Performance Evaluation =<br />
<br />
== 5.1 Main Results ==<br />
<br />
== 5.2 Ablation Study on Model Training ==<br />
<br />
== 5.3 Qualitative Analysis ==<br />
<br />
== 5.4 Run-time Efficiency ==<br />
<br />
= 6. Experiments: Question Answering =<br />
<br />
= 7. Related Work =<br />
<br />
= 8. Conclusion =<br />
<br />
= Critiques =<br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:dpr_loss_fn.png&diff=44215File:dpr loss fn.png2020-11-14T22:47:37Z<p>X46yan: </p>
<hr />
<div></div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=44194Dense Passage Retrieval for Open-Domain Question Answering2020-11-14T22:02:08Z<p>X46yan: </p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= 1. Introduction =<br />
Open domain question answering is a task that finds question answers from a large collection of documents. Nowadays open domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding, and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. <br />
<br />
= 2. Background =<br />
The following example clearly shows what problems open domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader.<br />
<br />
= 3. Dense Passage Retriever =<br />
<br />
== 3.1 Model Architecture Overview ==<br />
<br />
== 3.2 Training ==<br />
<br />
= 4. Experimental Setup =<br />
<br />
= 5. Retrieval Performance Evaluation =<br />
<br />
== 5.1 Main Results ==<br />
<br />
== 5.2 Ablation Study on Model Training ==<br />
<br />
== 5.3 Qualitative Analysis ==<br />
<br />
== 5.4 Run-time Efficiency ==<br />
<br />
= 6. Experiments: Question Answering =<br />
<br />
= 7. Related Work =<br />
<br />
= 8. Conclusion =<br />
<br />
= Critiques =<br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=44116Dense Passage Retrieval for Open-Domain Question Answering2020-11-14T17:46:31Z<p>X46yan: </p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= 1. Introduction =<br />
Open domain question answering is a task that finds question answers from a large collection of documents. Nowadays open domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods, and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. <br />
<br />
= 2. Background =<br />
<br />
<br />
= 3. Dense Passage Retriever =<br />
<br />
== 3.1 Model Architecture Overview ==<br />
<br />
== 3.2 Training ==<br />
<br />
= 4. Experimental Setup =<br />
<br />
= 5. Retrieval Performance Evaluation =<br />
<br />
== 5.1 Main Results ==<br />
<br />
== 5.2 Ablation Study on Model Training ==<br />
<br />
== 5.3 Qualitative Analysis ==<br />
<br />
== 5.4 Run-time Efficiency ==<br />
<br />
= 6. Experiments: Question Answering =<br />
<br />
= 7. Related Work =<br />
<br />
= 8. Conclusion =<br />
<br />
= Critiques =<br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=43983Dense Passage Retrieval for Open-Domain Question Answering2020-11-14T04:25:46Z<p>X46yan: </p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= 1. Introduction =<br />
<br />
<br />
= 2. Background =<br />
<br />
<br />
= 3. Dense Passage Retriever =<br />
<br />
== 3.1 Model Architecture Overview ==<br />
<br />
== 3.2 Training ==<br />
<br />
= 4. Experimental Setup =<br />
<br />
= 5. Retrieval Performance Evaluation =<br />
<br />
== 5.1 Main Results ==<br />
<br />
== 5.2 Ablation Study on Model Training ==<br />
<br />
== 5.3 Qualitative Analysis ==<br />
<br />
== 5.4 Run-time Efficiency ==<br />
<br />
= 6. Experiments: Question Answering =<br />
<br />
= 7. Related Work =<br />
<br />
= 8. Conclusion =<br />
<br />
= Critiques =<br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=43982Dense Passage Retrieval for Open-Domain Question Answering2020-11-14T04:21:33Z<p>X46yan: </p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= Introduction =<br />
<br />
<br />
= Background =<br />
<br />
<br />
= Dense Passage Retriever =<br />
<br />
== Model Architecture Overview ==<br />
<br />
== Training ==<br />
<br />
= Datasets =<br />
<br />
= Retrieval Performance Evaluation =<br />
<br />
== Main Results ==<br />
<br />
== Ablation Study on Model Training ==<br />
<br />
= Experiments: Question Answering =<br />
<br />
= Conclusion =<br />
<br />
= Critiques =<br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=43981Dense Passage Retrieval for Open-Domain Question Answering2020-11-14T04:16:41Z<p>X46yan: </p>
<hr />
<div>== Presented by == <br />
Nicole Yan<br />
<br />
== Introduction == <br />
<br />
<br />
== Background == <br />
<br />
<br />
== Dense Passage Retriever == <br />
<br />
<br />
= Model Architecture Overview =<br />
<br />
= Training =<br />
<br />
== Experimental Setup ==<br />
<br />
<br />
<br />
<br />
== Conclusion ==<br />
<br />
== Critiques ==<br />
<br />
== References ==<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=43980Dense Passage Retrieval for Open-Domain Question Answering2020-11-14T04:09:39Z<p>X46yan: Created page with "== Presented by == Nicole Yan == Introduction == == Previous Work == == Motivation == == Model Architecture == == ILSVRC 2014 Challenge Results == == Conclusio..."</p>
<hr />
<div>== Presented by == <br />
Nicole Yan<br />
<br />
== Introduction == <br />
<br />
<br />
== Previous Work == <br />
<br />
<br />
== Motivation == <br />
<br />
<br />
== Model Architecture == <br />
<br />
<br />
== ILSVRC 2014 Challenge Results ==<br />
<br />
<br />
== Conclusion ==<br />
<br />
== Critiques ==<br />
<br />
== References ==<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=43979stat940F212020-11-14T04:06:35Z<p>X46yan: </p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||<br />
|-<br />
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || [https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration Summary] ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization Summary]|| [https://youtu.be/b9MU5cc3-m0 Presentation Video]<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification Summary] || [https://drive.google.com/file/d/1Dx6mFL_zBAJcfPQdOWAuPn0_HkvTL_0z/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES || [https://openreview.net/pdf?id=HJxdTxHYvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates Summary] || [[https://drive.google.com/file/d/1amkWrR8ZQKnnInjedRZ7jbXTqCA8Hy1r/view?usp=sharing Presentation ]] ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION Summary] ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Dense Passage Retrieval for Open-Domain Question Answering || [https://arxiv.org/abs/2004.04906 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering Summary] ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| AdaCompress: Adaptive Compression for Online Computer Vision Services || [https://arxiv.org/pdf/1909.08148.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || |-<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=43978stat940F212020-11-14T03:34:15Z<p>X46yan: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||<br />
|-<br />
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || [https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration Summary] ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization Summary]|| [https://youtu.be/b9MU5cc3-m0 Presentation Video]<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification Summary] || [https://drive.google.com/file/d/1Dx6mFL_zBAJcfPQdOWAuPn0_HkvTL_0z/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES || [https://openreview.net/pdf?id=HJxdTxHYvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates Summary] || [[https://drive.google.com/file/d/1amkWrR8ZQKnnInjedRZ7jbXTqCA8Hy1r/view?usp=sharing Presentation ]] ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION Summary] ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Dense Passage Retrieval for Open-Domain Question Answering || [https://arxiv.org/abs/2004.04906 Paper] || ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| AdaCompress: Adaptive Compression for Online Computer Vision Services || [https://arxiv.org/pdf/1909.08148.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || |-<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=43977stat940F212020-11-14T03:25:57Z<p>X46yan: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||<br />
|-<br />
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || [https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration Summary] ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization Summary]|| [https://youtu.be/b9MU5cc3-m0 Presentation Video]<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification Summary] || [https://drive.google.com/file/d/1Dx6mFL_zBAJcfPQdOWAuPn0_HkvTL_0z/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES || [https://openreview.net/pdf?id=HJxdTxHYvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates Summary] || [[https://drive.google.com/file/d/1amkWrR8ZQKnnInjedRZ7jbXTqCA8Hy1r/view?usp=sharing Presentation ]] ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION Summary] ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| RepBERT: Contextualized Text Embeddings for First-Stage Retrieval || [https://arxiv.org/abs/2006.15498 Paper] || ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| AdaCompress: Adaptive Compression for Online Computer Vision Services || [https://arxiv.org/pdf/1909.08148.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || |-<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification&diff=43832a fair comparison of graph neural networks for graph classification2020-11-12T17:23:32Z<p>X46yan: </p>
<hr />
<div>== Presented By ==<br />
Jaskirat Singh Bhatia<br />
<br />
==Background==<br />
Experimental reproducibility and replicability are critical topics in machine learning.<br />
Authors have often raised concerns about their lack in scientific publications<br />
to improve the quality of the field. Recently, the graph representation learning<br />
field has attracted the attention of a wide research community, which resulted in<br />
a large stream of works. As such, several Graph Neural Network models have<br />
been developed to effectively tackle graph classification. However, experimental<br />
procedures often lack rigorousness and are hardly reproducible. The authors tried to reproduce <br />
the results from such experiments to tackle the problem of ambiguity in experimental procedures <br />
and the impossibility of reproducing results. They also Standardized the experimental environment <br />
so that the results could be reproduced while using this environment.<br />
<br />
==Graph Neural Networks==<br />
A graph is a data structure consisting of nodes and edges. Graph neural networks are models which take graph-structured data as input, and capture information of the input graph, such as relation and interaction between nodes. In graph neural networks, nodes aggregate information from their neighbors. The key idea is to generate representations of nodes depending on the graph structure. <br />
<br />
Graph neural networks can perform various tasks and have been used in many applications. Some simple and typical tasks include classifying the input graph or finding a missing edge/ node in the graph. One example of real applications where GNNs are used is social network prediction and recommendation, where the input data is naturally structural. <br />
<br />
==Problems in Papers==<br />
Some of the most common reproducibility problems encountered in this field concern hyperparameters<br />
selection and the correct usage of data splits for model selection versus model assessment.<br />
Moreover, the evaluation code is sometimes missing or incomplete, and experiments are not<br />
standardized across different works in terms of node and edge features.<br />
<br />
These issues easily generate doubts and confusion among practitioners that need a fully transparent<br />
and reproducible experimental setting. As a matter of fact, the evaluation of a model goes through<br />
two different phases, namely model selection on the validation set and model assessment on the<br />
test set. Clearly, to fail in keeping these phases well separated could lead to over-optimistic and<br />
biased estimates of the true performance of a model, making it hard for other researchers to present<br />
competitive results without following the same ambiguous evaluation procedures.<br />
<br />
==Risk Assessment and Model Selection==<br />
'''Risk Assesment<br />
<br />
The goal of risk assessment is to provide an estimate of the performance of a class of models.<br />
When a test set is not explicitly given, a common way to proceed is to use k-fold Cross Validation.<br />
As model selection is performed independently for<br />
each training/test split, they obtain different “best” hyper-parameter configurations; this is why they<br />
refer to the performance of a class of models.<br />
<br />
'''Model Selection<br />
<br />
The goal of model selection, or hyper-parameter tuning, is to choose among a set of candidate hyperparameter<br />
configurations the one that works best on a specific validation set.<br />
<br />
==Reproducibility Issues==<br />
===The GNN's were selected based on the following criteria===<br />
<br />
1. Performances obtained with 10-fold CV<br />
<br />
2. Peer reviews<br />
<br />
3. Strong architectural differences<br />
<br />
4. Popularity<br />
<br />
===Criteria to assess the quality of evaluation and reproducibility was as follows===<br />
<br />
1. Code for data pre-processing<br />
<br />
2. Code for model selection<br />
<br />
3. Data splits are provided<br />
<br />
4. Data is split by means of a stratification technique<br />
<br />
5. Results of the 10-fold CV are reported correctly using standard deviations<br />
<br />
Using the following criteria, 4 different papers were selected and their assessment on the quality of evaluation and reproducibility is as follows:<br />
<br />
[[File:table_3.png|700px|Image: 700 pixels]]<br />
<br />
==Experiments==<br />
They re-evaluate the above-mentioned models<br />
on 9 datasets (4 chemical, 5 social), using a model selection and assessment framework that closely<br />
follows the rigorous practices as described earlier.<br />
In addition, they implemented two baselines<br />
whose purpose is to understand the extent to which GNNs are able to exploit structural information.<br />
<br />
===Datasets===<br />
<br />
All graph datasets used are publicly available (Kersting et al., 2016) and represent a relevant<br />
subset of those most frequently used in literature to compare GNNs.<br />
<br />
===Features===<br />
<br />
In GNN literature, it is common practice to augment node descriptors with structural<br />
features. In general, good experimental practices suggest that all models should be consistently compared to<br />
the same input representations. This is why they re-evaluate all models using the same node features.<br />
In particular, they use one common setting for the chemical domain and two alternative settings<br />
as regards the social domain.<br />
<br />
===Baseline Model===<br />
<br />
They adopted two distinct baselines, one for chemical and one for social datasets. On all<br />
chemical datasets but for ENZYMES, they follow Ralaivola et al. (2005); Luzhnica et al. (2019)<br />
and implement the Molecular Fingerprint technique. On social domains<br />
and ENZYMES (due to the presence of additional features), they take inspiration from the work of<br />
Zaheer et al. (2017) to learn permutation-invariant functions over sets of nodes.<br />
<br />
===Experimental Setting===<br />
<br />
1. Used a 10-fold CV for model assessment<br />
and an inner holdout technique with a 90%/10% training/validation split for model selection.<br />
<br />
2. After each model selection, they train three times on the whole training fold, holding out a random fraction<br />
(10%) of the data to perform early stopping.<br />
<br />
3. The final test fold score is<br />
obtained as the mean of these three runs<br />
<br />
4. To be consistent with literature, they implemented early stopping with patience parameter<br />
n, where training stops if n epochs have passed without improvement on the validation set.<br />
<br />
[[File:image_1.png|900px|Image: 900 pixels]]<br />
<br />
===Hyper-Parameters===<br />
<br />
1. Hyper-parameter tuning was performed via grid search.<br />
<br />
2. They always included the hyper-parameters used by<br />
other authors in their respective papers.<br />
<br />
===Computational Considerations===<br />
<br />
As their research included a large number of training-testing cycles, they had to limit some of the models by:<br />
<br />
1. For all models, grid sizes ranged from 32 to 72 possible configurations, depending on the number of<br />
hyper-parameters to choose from.<br />
<br />
2. Limited the time to complete a single training to 72 hours.<br />
<br />
[[File:table_1.png|900px|Image: 900 pixels]]<br />
[[File:table_2.png|900px|Image: 900 pixels]]<br />
<br />
==Conclusion==<br />
<br />
1. Highlighted ambiguities in the experimental settings of different papers<br />
<br />
2. Proposed a clear and reproducible procedure for future comparisons<br />
<br />
3. Provided a complete re-evaluation of four GNNs<br />
<br />
4. Found out that structure-agnostic baselines outperform GNNs on some chemical datasets, thus suggesting that structural properties have not been exploited yet.<br />
<br />
==References==<br />
<br />
- Davide Bacciu, Federico Errica, and Alessio Micheli. Contextual graph Markov model: A deep<br />
and generative approach to graph processing. In Proceedings of the International Conference<br />
on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pp.<br />
294–303. PMLR, 2018.<br />
<br />
- Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from non-enzymes without<br />
alignments. Journal of molecular biology, 330(4):771–783, 2003.<br />
<br />
- Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In<br />
Advances in Neural Information Processing Systems (NIPS), pp. 1024–1034. Curran Associates,<br />
Inc., 2017.<br />
<br />
- Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark<br />
data sets for graph kernels, 2016. URL http://graphkernels.cs.tu-dortmund.<br />
de.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification&diff=43831a fair comparison of graph neural networks for graph classification2020-11-12T17:21:59Z<p>X46yan: </p>
<hr />
<div>== Presented By ==<br />
Jaskirat Singh Bhatia<br />
<br />
==Background==<br />
Experimental reproducibility and replicability are critical topics in machine learning.<br />
Authors have often raised concerns about their lack in scientific publications<br />
to improve the quality of the field. Recently, the graph representation learning<br />
field has attracted the attention of a wide research community, which resulted in<br />
a large stream of works. As such, several Graph Neural Network models have<br />
been developed to effectively tackle graph classification. However, experimental<br />
procedures often lack rigorousness and are hardly reproducible. The authors tried to reproduce <br />
the results from such experiments to tackle the problem of ambiguity in experimental procedures <br />
and the impossibility of reproducing results. They also Standardized the experimental environment <br />
so that the results could be reproduced while using this environment.<br />
<br />
==Graph Neural Networks==<br />
A graph is a data structure consisting of nodes and edges. Graph neural networks are models which take graph-structured data as input, and capture information of the input graphs, such as relation and interaction between nodes. In graph neural networks, nodes aggregate information from their neighbors. The key idea is to generate representations of nodes depending on the graph structure. <br />
<br />
Graph neural networks can perform various tasks and have been used in many applications. Some simple and typical tasks include classifying the input graph or finding a missing edge/ node in the graph. One example of real applications where GNNs are used is social network prediction and recommendation, where the input data is naturally structural. <br />
<br />
==Problems in Papers==<br />
Some of the most common reproducibility problems encountered in this field concern hyperparameters<br />
selection and the correct usage of data splits for model selection versus model assessment.<br />
Moreover, the evaluation code is sometimes missing or incomplete, and experiments are not<br />
standardized across different works in terms of node and edge features.<br />
<br />
These issues easily generate doubts and confusion among practitioners that need a fully transparent<br />
and reproducible experimental setting. As a matter of fact, the evaluation of a model goes through<br />
two different phases, namely model selection on the validation set and model assessment on the<br />
test set. Clearly, to fail in keeping these phases well separated could lead to over-optimistic and<br />
biased estimates of the true performance of a model, making it hard for other researchers to present<br />
competitive results without following the same ambiguous evaluation procedures.<br />
<br />
==Risk Assessment and Model Selection==<br />
'''Risk Assesment<br />
<br />
The goal of risk assessment is to provide an estimate of the performance of a class of models.<br />
When a test set is not explicitly given, a common way to proceed is to use k-fold Cross Validation.<br />
As model selection is performed independently for<br />
each training/test split, they obtain different “best” hyper-parameter configurations; this is why they<br />
refer to the performance of a class of models.<br />
<br />
'''Model Selection<br />
<br />
The goal of model selection, or hyper-parameter tuning, is to choose among a set of candidate hyperparameter<br />
configurations the one that works best on a specific validation set.<br />
<br />
==Reproducibility Issues==<br />
===The GNN's were selected based on the following criteria===<br />
<br />
1. Performances obtained with 10-fold CV<br />
<br />
2. Peer reviews<br />
<br />
3. Strong architectural differences<br />
<br />
4. Popularity<br />
<br />
===Criteria to assess the quality of evaluation and reproducibility was as follows===<br />
<br />
1. Code for data pre-processing<br />
<br />
2. Code for model selection<br />
<br />
3. Data splits are provided<br />
<br />
4. Data is split by means of a stratification technique<br />
<br />
5. Results of the 10-fold CV are reported correctly using standard deviations<br />
<br />
Using the following criteria, 4 different papers were selected and their assessment on the quality of evaluation and reproducibility is as follows:<br />
<br />
[[File:table_3.png|700px|Image: 700 pixels]]<br />
<br />
==Experiments==<br />
They re-evaluate the above-mentioned models<br />
on 9 datasets (4 chemical, 5 social), using a model selection and assessment framework that closely<br />
follows the rigorous practices as described earlier.<br />
In addition, they implemented two baselines<br />
whose purpose is to understand the extent to which GNNs are able to exploit structural information.<br />
<br />
===Datasets===<br />
<br />
All graph datasets used are publicly available (Kersting et al., 2016) and represent a relevant<br />
subset of those most frequently used in literature to compare GNNs.<br />
<br />
===Features===<br />
<br />
In GNN literature, it is common practice to augment node descriptors with structural<br />
features. In general, good experimental practices suggest that all models should be consistently compared to<br />
the same input representations. This is why they re-evaluate all models using the same node features.<br />
In particular, they use one common setting for the chemical domain and two alternative settings<br />
as regards the social domain.<br />
<br />
===Baseline Model===<br />
<br />
They adopted two distinct baselines, one for chemical and one for social datasets. On all<br />
chemical datasets but for ENZYMES, they follow Ralaivola et al. (2005); Luzhnica et al. (2019)<br />
and implement the Molecular Fingerprint technique. On social domains<br />
and ENZYMES (due to the presence of additional features), they take inspiration from the work of<br />
Zaheer et al. (2017) to learn permutation-invariant functions over sets of nodes.<br />
<br />
===Experimental Setting===<br />
<br />
1. Used a 10-fold CV for model assessment<br />
and an inner holdout technique with a 90%/10% training/validation split for model selection.<br />
<br />
2. After each model selection, they train three times on the whole training fold, holding out a random fraction<br />
(10%) of the data to perform early stopping.<br />
<br />
3. The final test fold score is<br />
obtained as the mean of these three runs<br />
<br />
4. To be consistent with literature, they implemented early stopping with patience parameter<br />
n, where training stops if n epochs have passed without improvement on the validation set.<br />
<br />
[[File:image_1.png|900px|Image: 900 pixels]]<br />
<br />
===Hyper-Parameters===<br />
<br />
1. Hyper-parameter tuning was performed via grid search.<br />
<br />
2. They always included the hyper-parameters used by<br />
other authors in their respective papers.<br />
<br />
===Computational Considerations===<br />
<br />
As their research included a large number of training-testing cycles, they had to limit some of the models by:<br />
<br />
1. For all models, grid sizes ranged from 32 to 72 possible configurations, depending on the number of<br />
hyper-parameters to choose from.<br />
<br />
2. Limited the time to complete a single training to 72 hours.<br />
<br />
[[File:table_1.png|900px|Image: 900 pixels]]<br />
[[File:table_2.png|900px|Image: 900 pixels]]<br />
<br />
==Conclusion==<br />
<br />
1. Highlighted ambiguities in the experimental settings of different papers<br />
<br />
2. Proposed a clear and reproducible procedure for future comparisons<br />
<br />
3. Provided a complete re-evaluation of four GNNs<br />
<br />
4. Found out that structure-agnostic baselines outperform GNNs on some chemical datasets, thus suggesting that structural properties have not been exploited yet.<br />
<br />
==References==<br />
<br />
- Davide Bacciu, Federico Errica, and Alessio Micheli. Contextual graph Markov model: A deep<br />
and generative approach to graph processing. In Proceedings of the International Conference<br />
on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pp.<br />
294–303. PMLR, 2018.<br />
<br />
- Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from non-enzymes without<br />
alignments. Journal of molecular biology, 330(4):771–783, 2003.<br />
<br />
- Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In<br />
Advances in Neural Information Processing Systems (NIPS), pp. 1024–1034. Curran Associates,<br />
Inc., 2017.<br />
<br />
- Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark<br />
data sets for graph kernels, 2016. URL http://graphkernels.cs.tu-dortmund.<br />
de.</div>X46yan