BERTScore: Evaluating Text Generation with BERT

From statwiki
Jump to navigation Jump to search

Presented by

Gursimran Singh


Machine learning has recently popularized automated approaches for text generation. This paper aims to develop a metric that will judge the quality of the generated text. Commonly used state of the art metrics either uses n-gram approaches or word embeddings for calculating the similarity between the reference and the candidate sentence. BertScore, on the other hand, calculates the similarity using the cosine similarity of BERT [6] contextual embeddings. BertScore basically addresses two common pitfalls in n-gram-based metrics. Firstly, the n-gram models fail to robustly match paraphrases which leads to performance underestimation when semantically-correct phrases are penalized because of their difference from the surface form of the reference. On the other hand in BertScore, the similarity is computed using contextualized token embeddings, which have been shown to be effective for paraphrase detection. Secondly, n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes. In contrast, contextualized embeddings capture distant dependencies and ordering effectively. Finally, the BERTScore is a task-independent evaluation metric which makes it a better choice in comparison to other state of art models. The authors of the paper have carried out various experiments in Machine Translation and Image Captioning to show why BertScore is more reliable and robust than the previous approaches.

Word versus Context Embeddings

Both models aim to reduce the sparseness invoked by a bag of words (BoW) representation of text due to the high dimensional vocabularies. Both methods create embeddings of a dimensionality much lower than sparse BoW and aim to capture semantics and context. Word embeddings differ in that they will be deterministic as when given a word embedding model will always produce the same embedding, regardless of the surrounding words. However, contextual embeddings will create different embeddings for a word depending on the surrounding words in the given text.

BERT Background An excellent source for gaining an intuition underlying BERT and transformers is provided by Jay Alammar here (

Previous Work

Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-Gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences. Most of the methods utilize or slightly modify the exact match precision (Exact-[math]\displaystyle{ P_n }[/math]) and recall (Exact-[math]\displaystyle{ R_n }[/math]) scores. These scores can be formalized as follows:

Exact- [math]\displaystyle{ P_n = \frac{\sum_{w \ in S^{n}_{ \hat{x} }} \mathbb{I}[w \in S^{n}_{x}]}{S^{n}_{\hat{x}}} }[/math]
Exact- [math]\displaystyle{ R_n = \frac{\sum_{w \ in S^{n}_{x}} \mathbb{I}[w \in S^{n}_{\hat{x}}]}{S^{n}_{x}} }[/math]

Here [math]\displaystyle{ I[.] }[/math] is an indicator function, [math]\displaystyle{ S^{n}_{x} }[/math] and [math]\displaystyle{ S^{n}_{\hat{x}} }[/math] are lists of token [math]\displaystyle{ n }[/math]-grams in the reference [math]\displaystyle{ x }[/math] and candidate [math]\displaystyle{ \hat{x} }[/math] sentences respectively.

The most popular n-Gram Matching metric is BLEU (Bilingual Evaluation Understudy). The output for this metric is between 0.0 and 1.0 where a score of 0.0 denotes a perfect mismatch and a score of 1.0 denotes a perfect match between candidate sentence and reference sentence. It follows the underlying principle of n-Gram matching and made the following three modifications to Exact-[math]\displaystyle{ P_n }[/math] method:
• Each n-Gram is matched at most once.
• The total of exact-matches is accumulated for all reference candidate pairs and divided by the total number of [math]\displaystyle{ n }[/math]-grams in all candidate sentences.
• Very short candidates are restricted.

Further BLEU is generally calculated for multiple [math]\displaystyle{ n }[/math]-grams and averaged geometrically. n-Gram approaches also include METEOR, NIST, ΔBLEU, etc. METEOR (Banerjee & Lavie, 2005) computes Exact- [math]\displaystyle{ P_1 }[/math] and Exact- [math]\displaystyle{ R_1 }[/math] with the modification that when the exact unigram matching is not possible, matching to word stems, synonyms, and paraphrases are used instead. For example, running may be matched with run if no exact match was found. This non-exact matching is done using external tools such as a paraphrase table. In newer versions of METEOR, an external paraphrase resource is used and different weights are assigned to different matching types.

Other categories include Edit-distance-based Metrics which compare two strings by calculating the minimum operations to transform one into the other, Embedding-based metrics which are derive based on an applied embedding space to the strings, and Learned Metrics which construct task specific-metrics using a machine learning approach on a supervised data set. Most of these techniques do not capture the context of a word in the sentence. Moreover, Learned Metric approaches also require costly human judgements as supervision for each datasets.


The [math]\displaystyle{ n }[/math]-gram approaches like BLEU do not capture the positioning and the context of the word and simply rely on exact matching for evaluation. Consider the following example that shows how BLEU can result in incorrect judgment.
Reference: people like foreign cars
Candidate 1: people like visiting places abroad
Candidate 2: consumers prefer imported cars

BLEU gives a higher score to Candidate 1 as compared to Candidate 2. This undermines the performance of text generation models since contextually correct sentences are penalized. In contrast, some semantically different phrases are scored higher just because they are closer to the surface form of the reference sentence.

On the other hand, BERTScore computes similarity using contextual token embeddings. It helps in detecting semantically correct paraphrased sentences. It also captures the cause and effect relationship (A gives B in place of B gives A) that the BLEU score isn't detected.

BERTScore Architecture

Fig 1 summarizes the steps for calculating the BERTScore. Next, we will see details about each step. Here, the reference sentence is given by [math]\displaystyle{ x = ⟨x1, . . . , xk⟩ }[/math] and candidate sentence [math]\displaystyle{ \hat{x} = ⟨\hat{x1}, . . . , \hat{xl}⟩. }[/math]

Illustration of the computation of BERTScore.
Fig 1

Token Representation

Reference and candidate sentences are represented using contextual embeddings. Word embedding techniques inspire this but in contrast to word embeddings, the contextual embedding of a word depends upon the surrounding words in the sentence. These contextual embeddings are calculated using BERT, Roberta, XLNET, and XLM models, which utilize self-attention and nonlinear transformations.

Pearson Correlation for Contextual Embedding
Fig 2

Cosine Similarity

Pairwise cosine similarity is calculated between each token [math]\displaystyle{ x_{i} }[/math] in reference sentence and [math]\displaystyle{ \hat{x}_{j} }[/math] in candidate sentence. Prenormalized vectors are used, therefore the pairwise similarity is given by [math]\displaystyle{ x_{i}^T \hat{x_{i}}. }[/math]


Each token in x is matched to the most similar token in [math]\displaystyle{ \hat{x} }[/math] and vice-versa for calculating Recall and Precision respectively. The matching is greedy and isolated. Precision and Recall are combined for calculating the F1 score. The equations for calculating Precision, Recall, and F1 Score are as follows

Equations for the calculation of BERTScore.

Importance Weighting (optional)

In some cases, rare words can be highly indicative of sentence similarity. Therefore, Inverse Document Frequency (idf) can be used with the above equations of the BERTScore. This is optional and depending on the domain of the text and the available data it may or may not benefit the final results. Thus understanding more about Importance Weighing is an open area of research.

Baseline Rescaling

Rescaling is done only to increase the human readability of the score. In theory, cosine similarity values are between -1 and 1 but practically they are confined in a much smaller range. A value b computed using Common Crawl monolingual datasets is used to linearly rescale the BERTScore. The rescaled recall [math]\displaystyle{ \hat{R}_{BERT} }[/math] is given by

Equation for the rescaled BERTScore.

Similarly, [math]\displaystyle{ P_{BERT} }[/math] and [math]\displaystyle{ F_{BERT} }[/math] are rescaled as well.

Experiment & Results

The authors have experimented with different pre-trained contextual embedding models like BERT, RoBERTa, etc, and reported the best performing model results. In addition to the standard evaluation, they have also designed model selection experiments. They used 10K hybrid systems super-sampled from WMT18. They randomly select 100 out of 10K hybrid systems and rank them using the automatic metrics. The evaluation has been done on Machine Translation and Image Captioning tasks.

Machine Translation

The metric evaluation dataset consists of 149 translation systems, gold references, and two types of human judgments, namely, Segment-level human judgments and System-level human judgments. The former assigns a score to each reference candidate pair and the latter associates a single score for the whole system. Segment-level outputs for BERTScore are calculated as explained in the previous section on architecture and the System-level outputs are calculated by taking an average of BERTScore for every reference-candidate pair. Absolute Pearson Correlation [math]\displaystyle{ \lvert \rho \rvert }[/math] and Kendall rank correlation [math]\displaystyle{ \tau }[/math] are used for calculating metric quality, Williams test [1] for significance of [math]\displaystyle{ \lvert \rho \rvert }[/math] and Graham & Baldwin [2] methods for calculating the bootstrap resampling of [math]\displaystyle{ \tau }[/math]. The authors have also created hybrid systems by randomly sampling one candidate sentence for each reference sentence from one of the systems. This increases the volume of systems for System-level experiments. Further, the authors have also randomly selected 100 systems out of 10k hybrid systems for ranking them using automatic metrics. They have repeated this process multiple times and generated Hits@1, which contains the percentage of the metric ranking agreeing with human ranking on the best system.

The following 4 tables show the result of the experiments mentioned above.

Table1 Machine Translation Table2 Machine Translation
Table3 Machine Translation Table4 Machine Translation

In all 4 tables, we can see that BERTScore is consistently a top performer. It also gives a large improvement over the current state-of-the-art BLEU score. In to-English translation, RUSE shows competitive results but it is a learned metric technique and requires costly human judgments as supervision.

Image Captioning

For Image Captioning, human judgment for twelve submission entries from the COCO 2015 Captioning Challenge is used. As per Cui et al. (2018) [3] , Pearson Correlation with two System-Level metrics is calculated. The metrics used in the results are the percentage of captions better than or equal to human captions (M1) and the percentage of captions that are indistinguishable from human captions (M2). There are approximately five reference captions and the BERTScore is taken to be the maximum of all the BERTScores individually with each reference caption. BERTScore is compared with eight task-agnostic metrics (shown under the Metric column in Table 5) and two task-specific metrics, Semantic Propositional Image Caption Evaluation (SPICE) [8] and Learning to Evaluate Image Caption (LEIC) [3]. Given an input image, LEIC predicts whether a caption is written by a human whereas SPICE makes use of scene graphs parsed from reference and candidate captions to compare the similarity.

Table5 Image Captioning
Table 5: Pearson correlation on the 2015 COCO Captioning Challenge.

BERTScore is again a top performer and n-gram metrics like BLEU show a weak correlation with human judgments. For this task, importance weighting shows significant improvement depicting the importance of content words.

Speed: The time taken for calculating BERTScore is not significantly higher than BLEU. For example, with the same hardware, the Machine Translation test on BERTScore takes 15.6 secs compared to 5.4 secs for BLEU. The time range is essentially small and thus the difference is marginal.

Robustness Analysis

The authors tested BERTScore's robustness using two adversarial paraphrase classification datasets, QQP, and PAWS. The table below summarized the result. Most metrics have a good performance on QQP, but their performance drops significantly on PAWS. Conversely, BERTScore performs competitively on PAWS, which suggests BERTScore is better at distinguishing harder adversarial examples.

Source Code

The code for this paper is available at BERTScore.

Critique & Future Prospects

A text evaluation metric, BERTScore, is proposed and outperforms the previous approaches because of its capacity to use contextual embeddings for evaluation. It is simpler, easier to use, and more robust than previous approaches. This is shown by the experiments carried on the datasets consisting of paraphrased sentences. There are variants of BERTScore depending upon the contextual embedding model, use of importance weighting, and the evaluation metric (Precision, Recall, or F1 score).

The main reason behind the success of BERTScore is the use of contextual embeddings. The remaining architecture is straightforward in itself. There are some word embedding models that use complex metrics for calculating similarity. If we try to use those models along with contextual embeddings instead of word embeddings, they might result in more reliable performance than the BERTScore.

BERT can also be used for other Natural Language Processing tasks like text classification, NER and etc. In the NER task, the IOB-NER tagging system was applied to the prediction model. The model and taking system could be found in the SpaCy package and then a performance metrics called through Keras will be efficient enough to evaluate the model. We can observe some drawbacks of this model which includes more memory consumption and higher time complexity as compared to its predecessor BLEU

The paper was quite interesting, but it is obvious that they lack technical novelty in their proposed approach. Their method is a natural application of BERT along with traditional cosine similarity measures and precision, recall, F1-based computations, and simple IDF-based importance weighting. In the future, the authors should consider scaling the model for a pair of languages where the words are not directly comparable. Also, the model should be able to compare between a bad and the worst output and clearly classify the best output from the available options.


[1] Evan James Williams. Regression analysis. wiley, 1959.

[2] Yvette Graham and Timothy Baldwin. Testing for significance of increased correlation with human judgment. In EMNLP, 2014.

[3] Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge J. Belongie. Learning to evaluate image captioning. In CVPR, 2018.

[4] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.

[5] Qingsong Ma, Ondrej Bojar, and Yvette Graham. Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance. In WMT, 2018.

[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.

[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. arXiv, abs/1907.11692, 2019b.

[8] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. SPICE: Semantic propositional image caption evaluation. In ECCV, 2016.