SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
Natural Language Processing (NLP) has seen immense improvements over the past two years. The improvements offered by Transformer  based models such as ELMo , OpenAI GPT , BERT, etc. have revolutionized the field. These models render GLUE , the standard benchmark for NLP tasks, ineffective. The GLUE benchmark was released over a year ago and assessed NLP models using a single-number metric that summarized performance over some diverse tasks. However, the transformer-based models outperform the non-expert humans in several tasks. With transformer-based models achieving near-perfect scores on almost all tasks in GLUE and outperforming humans in some, there is a need for a new benchmark that involves harder and even more diverse language tasks. The authors release SuperGLUE as a new benchmark that has a more rigorous set of language understanding tasks.
There have been several benchmarks attempting to standardize the field of language understanding tasks. SentEval  evaluated fixed-size sentence embeddings for tasks. DecaNLP  converts tasks into a general question-answering format. GLUE  offers a much more flexible and extensible benchmark since it imposes no restrictions on model architectures or parameter sharing.
GLUE has been the gold standard for language understanding tests since it’s release. In fact, the benchmark has promoted growth in language models with all the transformer-based models started with attempting to achieve high scores on GLUE. Original GPT and BERT models scored 72.8 and 80.2 on GLUE. The latest GPT and BERT models, however, far outperform these benchmarks and strike a need for a more robust and difficult benchmark.
Transformer based NLP models allow NLP models to train using transfer learning which was previously only seen in Computer Vision tasks and was notoriously difficult for language because of the discrete nature of words. Transfer Learning in NLP allows models to be trained over terabytes of language data in a self-supervised fashion. These models can then be finetuned for downstream tasks such as sentiment classification, fake news detection, etc. The fine-tuned models beat many of the human labellers who weren’t experts in the domain. Thus, creating a need for a newer, more robust baseline that can stay relevant with the rapid improvements in the field of NLP.
SuperGLUE has 8 language understanding tasks. They test a model’s understanding of texts in English. The tasks are built to be equivalent to the capabilities of most college-educated English speakers and are beyond the capabilities of most state-of-the-art systems today.
BoolQ (Boolean Questions, [clark et al.]): QA task consisting of short passage and related questions to the passage as either a yes or a no answer.
CB (CommitmentBank ): Corpus of text where sentences have embedded clauses and sentences are written with the goal of keeping the clause accurate.
COPA (Choice of Potential Alternatives ): Reasoning tasks where given a sentence the system must be able to choose the cause or effect of the sentence from two potential choices.
MultiRC (Multi-Sentence Reading Comprehension): QA task where given a passage and potential answers, the model should label the answers as true or false.
ReCoRD (Reading Comprehension with Commonsense Reasoning Dataset, Zhang et al.): A multiple-choice, question answering task, where given a passage with a masked entity, the model should be able to select the correct entity from the choices.
RTE (Recognizing Textual Entailment): Detecting text that can be plausibly inferred from a given passage.
WiC (Word in Context): Identifying whether a polysemous word used in multiple sentences is being used with the same sense across sentences or not.
WSC (Winograd Schema Challenge, ): A conference resolution task where sentences include pronouns and noun phrases from the sentence. The goal is to identify the correct reference to a noun phrase corresponding to the pronoun.
Table 1 offers a summary of the results from SuperGLUE across different models. CBOW baselines are generally close to roughly chance performance. BERT, on the other hand, increased the SuperGLUE score by 25 points and had the highest improvement on most tasks, especially MultiRCC, ReCoRD, and RTE. WSC is trickier for BERT, potentially owing to the small dataset size.
BERT++ increases BERT’s performance even further. However, achieving the goal of the benchmark, the best model/score still lags behind compared to human performance. The large gaps should be relatively tricky for models to close in on. The biggest margin is for WSC with 35 points and CV, RTE, BoolQ, WiC all have 10 point margins.
SuperGLUE fills the gap that GLUE has created owing to its inability to keep up with the SOTA in NLP. The new language tasks that the benchmark offers are built to be more robust and difficult to solve for NLP models. With the difference in model accuracy being around 10-35 points across all tasks, SuperGLUE is definitely going to be around for some time before the models catch up to it, as well. Overall, this is a significant contribution to improve general-purpose natural language understanding.
This is quite a fascinating read where the authors of the gold-standard benchmark have essentially conceded to the progress in NLP. Bowman’s team resorting to creating a new benchmark altogether to keep up with the rapid pace of increase in NLP makes me wonder if these benchmarks are inherently flawed. Applying the idea of Wittgenstein’s Ruler, are we measuring the performance of models using the benchmark, or the quality of benchmarks using the models?
I’m curious how long SuperGLUE would stay relevant owing to advances in NLP. GPT-3, released in June 2020, has outperformed GPT-2 and BERT by a huge margin, given the 100x increase in parameters (175B Parameters over ~600GB for GPT-3, compared to 1.5B parameters over 40GB for GPT-2). In October 2020, a new deep learning technique (Pattern Exploiting Training) managed to train a Transformer NLP model with 223M parameters (roughly 0.01% parameters of GPT-3) and outperformed GPT-3 by 3 points on SuperGLUE. With the field improving so rapidly, I think superGLUE is nothing but a bandaid for the benchmarking tasks that will turn obsolete in no time.