BERTScore: Evaluating Text Generation with BERT: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 8: Line 8:
Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences.  
Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences.  
The most popular n-Gram Matching metric is BLEU. It follows the underlying principle of n-Gram matching and its uniqueness comes from three main factors. <br>
The most popular n-Gram Matching metric is BLEU. It follows the underlying principle of n-Gram matching and its uniqueness comes from three main factors. <br>
1. Each n-Gram is matched at most once. <br>
Each n-Gram is matched at most once. <br>
2. The total of exact-matches is accumulated for all reference candidate pairs. <br>
The total of exact-matches is accumulated for all reference candidate pairs. <br>
3. Very short candidates are restricted. <br>
Very short candidates are restricted. <br>
 
n-Gram also approaches include METEOR, NIST, ΔBLEU, etc.
n-Gram also approaches include METEOR, NIST, ΔBLEU, etc.



Revision as of 14:00, 24 November 2020

Presented by

Gursimran Singh

Introduction

In recent times, various machine learning approaches for text generation have gained popularity. The idea behind this paper is to develop an automatic metric that will judge the quality of the generated text. Commonly used state of the art metrics either uses n-gram approach or word embeddings for calculating the similarity between the reference and the candidate sentence. BertScore, on the other hand, calculates the similarity using contextual embeddings. The authors of the paper have carried out various experiments in Machine Translation and Image Captioning to show why BertScore is more reliable and robust than the previous approaches.

Previous Work

Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences. The most popular n-Gram Matching metric is BLEU. It follows the underlying principle of n-Gram matching and its uniqueness comes from three main factors.
• Each n-Gram is matched at most once.
• The total of exact-matches is accumulated for all reference candidate pairs.
• Very short candidates are restricted.

n-Gram also approaches include METEOR, NIST, ΔBLEU, etc.

Other categories include Edit-distance-based Metrics, Embedding-based metrics, and Learned Metrics. Most of these techniques do not capture the context of a word in the sentence. Moreover, Learned Metric approaches also require costly human judgments as supervision for each dataset.

Motivation

Model Architecture

Results

Conclusion

References