BERTScore: Evaluating Text Generation with BERT

Presented by

Gursimran Singh

Introduction

In recent times, various machine learning approaches for text generation have gained popularity. The idea behind this paper is to develop an automatic metric that will judge the quality of the generated text. Commonly used state of the art metrics either uses n-gram approach or word embeddings for calculating the similarity between the reference and the candidate sentence. BertScore, on the other hand, calculates the similarity using contextual embeddings. The authors of the paper have carried out various experiments in Machine Translation and Image Captioning to show why BertScore is more reliable and robust than the previous approaches.

Previous Work

Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences. The most popular n-Gram Matching metric is BLEU. It follows the underlying principle of n-Gram matching and its uniqueness comes from three main factors.
• Each n-Gram is matched at most once.
• The total of exact-matches is accumulated for all reference candidate pairs.
• Very short candidates are restricted.

n-Gram approaches also include METEOR, NIST, ΔBLEU, etc.

Other categories include Edit-distance-based Metrics, Embedding-based metrics, and Learned Metrics. Most of these techniques do not capture the context of a word in the sentence. Moreover, Learned Metric approaches also require costly human judgments as supervision for each dataset.

Motivation

The n-gram approaches like BLEU do not capture the positioning and the context of the word and simply rely on exact matching for evaluation. Consider the following example that shows how BLEU can result in incorrect judgment.
Reference: people like foreign cars
Candidate 1: people like visiting places abroad
Candidate 2: consumers prefer imported cars

BLEU gives a higher score to Candidate 1 as compared to Candidate 2. This undermines the performance of text generation models since contextually correct sentences are penalized whereas some semantically different phrases are scored higher just because they are closer to the surface form of the reference sentence.

On the other hand, BERTScore computes similarity using contextual token embeddings. It helps in detecting semantically correct paraphrased sentences. It also captures the cause and effect relationship (A gives B in place of B gives A) that isn't detected by the BLEU score.

BERTScore Architecture

Fig 1 summarizes the steps for calculating the BERTScore. Next, we will see details about each step. Here, the reference sentence is given by [math]\displaystyle{ x = ⟨x1, . . . , xk⟩ }[/math] and candidate sentence [math]\displaystyle{ \hat{x} = ⟨\hat{x1}, . . . , \hat{xl}⟩. }[/math]

Illustration of the computation of BERTScore.

Fig 1

Token Representation

Reference and the candidate sentences are represented using contextual embeddings. This is inspired by word embedding techniques but in contrast to word embeddings, the contextual embedding of a word depends upon the surrounding words in the sentence. These contextual embeddings are calculated using BERT which utilizes self-attention and nonlinear transformations.

Cosine Similarity

Pairwise cosine similarity is calculated between each token [math]\displaystyle{ x_{i} }[/math] in reference sentence and [math]\displaystyle{ \hat{x}_{j} }[/math] in candidate sentence. Prenormalized vectors are used therefore the pairwise similarity is given by [math]\displaystyle{ x_{i}^T \hat{x_{i}}. }[/math]

BERTScore

Each token in x is matched to the most similar token in x^ and vice-versa for calculating Recall and Precision respectively. The matching is greedy and isolated. Precision and Recall are combined for calculating the F1 score. The equations for calculating Precision, Recall, and F1 Score are as follows

Importance Weighting (optional)

In some cases, rare words can be highly indicative of sentence similarity. Therefore, Inverse Document Frequency (idf) can be used with the above equations of the BERTScore. This is optional and in fact, authors have reported that it provides little to no benefit to the final results. Thus understanding more about Importance Weighing is an open area of research.

Baseline Rescaling

Rescaling is done only to increase the human readability of the score. In theory, cosine similarity values are between -1 and 1 but practically they are confined in a much smaller range. A value b computed using Common Crawl monolingual datasets is used to linearly rescale the BERTScore. The rescaled recall [math]\displaystyle{ \hat{R}_{BERT} }[/math] is given by

Similarly, [math]\displaystyle{ P_{BERT} }[/math] and [math]\displaystyle{ F_{BERT} }[/math] are rescaled as well.

BERTScore: Evaluating Text Generation with BERT

Contents

Presented by

Introduction

Previous Work

Motivation

BERTScore Architecture

Token Representation

Cosine Similarity

BERTScore

Importance Weighting (optional)

Baseline Rescaling

Results

Conclusion

References

Navigation menu

BERTScore: Evaluating Text Generation with BERT

Presented by

Introduction

Previous Work

Motivation

BERTScore Architecture

Token Representation

Cosine Similarity

BERTScore

Importance Weighting (optional)

Baseline Rescaling

Results

Conclusion

References

Navigation menu

Search