RoBERTa: A Robustly Optimized BERT Pretraining Approach

Presented by

Danial Maleki

Introduction

Self-training methods in the Natural Language Processing (NLP) domain like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements in a variety of NLP tasks. However, it is difficult to determine which parts of the methods contribute most to their success. This paper proposed Roberta, a model which replicates BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, the main contributions of this paper are:

(1) Introducing the alternatives in design choices and training schemes of BERT, leading to better downstream task performance.

(2) Introducing a new dataset and validating the positive effect of using more data for pretraining on the performance of downstream tasks.

These 2 modification categories improve performance on downstream tasks.

Background

In this section, they gave an overview of BERT since they used this architecture in their model. The architecture of BERT is similar to the transformer encoder architecture, and the tuning processes contain an unsupervised feature-based approach and unsupervised fine-tuning approach. Very briefly, the transformer architecture defines attention over the embeddings in a layer such that the feedforward weights are a function of the embeddings at any given layer - ie are dynamic. BERT takes concatenated sequences of tokens as input. The two sequences are of length M (x1, x2,....xn) and N (y1, y2,....yn) where M + N < T (max length of input sequence during training). These sequences are concatenated along with special tokens [CLS] to specify the start of the new sequence, [SEP] specifying the separation of sequence, and [EOS] specifying the end of the input sequence. BERT uses transformer architecture[6] with two training objectives: they use masks language modeling (MLM) and next sentence prediction (NSP) as their objectives. The MLM objective randomly selects 80% of the tokens in the input sequence and replaces them with the special token [MASK] to prevent the model from cheating. MLM also replaces 10% of the remaining tokens with some random tokens from the vocabulary leaving the remaining ones unchanged. Then they try to predict these tokens based on the surrounding information. NSP employs a binary classification loss for predicting whether the two sentences are adjacent to each other. Positive sequences are selected by taking consecutive sentences from the corpus text. It is noteworthy that while selecting positive next sentences is trivial, generating negative ones is often much more difficult. Originally, they used a random next sentence as a negative example. They used Adam optimizer with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE, and showed their performance on those downstream tasks.

Experimental Setup

Implementation

They primarily follow the original BERT optimization hyperparameters, except for the peak learning rate and a number of warmup steps, which are tuned separately for each setting. They found the training to be very sensitive to the Adam epsilon term, and in some cases, they obtained better performance or improved stability after tuning it. They also set β2 = 0.98, a hyperparameter from the AdamW optimizer [10], to improve stability when training with large batch sizes.

Data

They consider five English-language corpora of varying sizes and domains, totaling over 160GB of uncompressed text: BOOKCORPUS(Zhu et al., 2015) plus English WIKIPEDIA, which is the original data used to train BERT (16GB); CC-NEWS, which they collect from the English portion of the CommonCrawl News dataset (Nagel, 2016), containing 63 million English news articles crawled between September 2016 and February 2019 (76GB after filtering);OPEN-WEBTEXT(Gokaslan & Cohen, 2019), an open-source recreation of the WebText corpus described in Radford et al. (2019), containing web content extracted from URLs shared on Reddit with at least three upvotes (38GB);5(5) STORIES, a dataset introduced in Trinh & Le (2018) containing a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas (31GB).

Training Procedure Analysis

In this section, they elaborate on which choices are important for successfully pretraining BERT. The author compared four factors: different masks - pre-train methods, batch sizes, and tokenization methods - to optimize the RoBERTa model.

Static vs. Dynamic Masking

First, they discussed static vs. dynamic masking. As mentioned in the previous section, the masked language modeling objective in BERT pre-training masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps. To extend this single static mask, the authors duplicated training data 10 times so that each sequence was masked in 10 different ways. The model was trained on those data for 40 epochs, with each sequence with the same mask being used 4 times.

Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.

Input Representation and Next Sentence Prediction

Next, they tried to investigate the necessity of the next sentence prediction objective. They tried different settings to show it would help with eliminating the NSP loss in pretraining.

(1) Segment-Pair + NSP: Each input has a pair of segments (segments, not sentences) from either the original document or some different document chosen at random with a probability of 0.5, and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.

(2) Sentence-Pair + NSP: This is the same as the segment-pair representation but with pairs of sentences instead of segments. However, the total length of sequences here would be a lot less than 512. Hence, larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.

(3) Full-Sentences: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.

(4) Doc-Sentences: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, with the difference that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs the same length.

In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. However, the authors chose to use FULL-SENTENCES for convenience sake, since the DOC-SENTENCES results in variable batch sizes.

Large Batch Sizes

The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes. The result suggested the original BERT batch size was too small. The authors used 8k batch size in the remainder of their experiments.

Tokenization

In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE. BPE is a hybrid between character and word level modeling based on sub-word units and repeating characters. The authors also used a vocabulary size of 50k rather than 30k as the original BERT implementation did, thus increasing the total parameters by approximately 15M to 20M for the BERT based and BERT large respectively. This change actually results in slight degradation of end-task performance in some cases, however, the authors preferred having the ability to universally encode text without introducing the "unknown" token.

RoBERTa

They claim that if they apply all the aforementioned modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.

(1) BookCorpus + English Wikipedia (16GB): This is the data on which BERT is trained.

(2) CC-News (76GB): The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.

(3) OpenWebText (38GB): Open Source recreation of the WebText dataset used to train OpenAI GPT.

(4) Stories (31GB): A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.

In conjunction with the modifications of dynamic masking, full-sentences whiteout an NLP loss, large mini-batches, and a larger byte-level BPE.

Results

Table 5: RoBERTa's developement set results during pretraining over more data (160GB from 16GB) and longer duration (100K to 300K to 500K steps). Results are accumulated from each row.

RoBERTa has outperformed state-of-the-art algorithms in almost all GLUE tasks, including ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the below table.

Table 3 presents the results of the experiments. RoBERTa offers major improvements over BERT (Large). Three additional datasets are added with the original dataset with the original step numbers (100K). In total, 160GB is used for pretraining. Finally, RoBERTa is pretrained for far more steps. Steps were increased from 100K to 300K and then to 500K. Improvements were observed across all downstream tasks. With the 500K steps, XLNet(large) is also outperformed across most tasks.

Conclusion

The results confirmed that employing large batches over more data along with longer training time improves the performance. In conclusion, they basically said the reasons why they make gains may be questionable, and if you decently pre-train BERT, you will achieve the same performances as RoBERTa.

The comparison at a glance

Critique

While the results are outstanding and appreciable (reasonably due to using more data and resources), the technical novelty contribution of the paper is marginally incremental as the architecture is largely unchanged from BERT. Another question that we should know about Roberta model or any language model can handle is, have the language models in general successfully acquired commonsense reasoning, or are we overestimating the true capabilities of machine commonsense?. The WINOGRANDE [9] is a large-scale dataset the contains 44k problems, inspired by Winograd Schema Challenge (WSC) design. The main key steps of constructing this dataset are the crowdsourcing procedure followed by systematic bias reduction that generalizes human-detectable word associations to machine-detectable embedding associations. I am curious to test this dataset on Roberta and BERT to see how much improvement they have done.

Source Code

The code for this paper is freely available at RoBERTa. The original repository for RoBERTa is in PyTorch. In case you are a TensorFlow user, you would be interested in PyTorch to TensorFlow repository as well.

Refrences

[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In the North American Association for Computational Linguistics (NAACL).

[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.

[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In the North American Association for Computational Linguistics (NAACL).

[4] Guillaume Lample and Alexis Conneau. 2019. Cross lingual language model pretraining. arXiv preprint arXiv:1901.07291.

[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.

[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.

[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).

[8] BERT, RoBERTa, DistilBERT, XLNet - which one to use? Suleiman Khan link

[9] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Winogrande: An adversarial winograd schema challenge at scale. ArXiv, abs/1907.10641.

[10] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.

Roberta

Contents

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Presented by

Introduction

Background