Roberta: Difference between revisions
No edit summary |
No edit summary |
||
Line 19: | Line 19: | ||
(1) | <b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation. | ||
(2) Sentence-Pair + NSP: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation. | <b>(2) Sentence-Pair + NSP</b>: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation. | ||
(3) Full-Sentences: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512. | <b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512. | ||
(4) Doc-Sentences: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length. | <b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length. | ||
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. | In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. | ||
Line 45: | Line 45: | ||
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below. | They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below. | ||
(1) BookCorpus + English Wikipedia (16GB): This is the data on which BERT is trained. | <b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained. | ||
(2) CC-News (76GB): The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019. | <b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019. | ||
(3) OpenWebText (38GB): Open Source recreation of the WebText dataset used to train OpenAI GPT. | <b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT. | ||
(4) Stories (31GB): A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas. | <b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas. | ||
== Results == | == Results == | ||
[[File:dataset.JPG|400px|center]] | [[File:dataset.JPG|400px|center]] |
Revision as of 19:21, 29 November 2020
Presented by
Danial Maleki
Introduction
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.
Background
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD[7], RACE[8] and show their performance on those downstream tasks.
Training Procedure Analysis
In this section, they elaborate on which choices are important for successfully pretraining BERT.
Static vs. Dynamic Masking
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.
Input Representation and Next Sentence Prediction
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.
(1) Segment-Pair + NSP: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.
(2) Sentence-Pair + NSP: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.
(3) Full-Sentences: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.
(4) Doc-Sentences: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss.
Large Batch Sizes
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.
Tokenization
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.
RoBERTa
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.
(1) BookCorpus + English Wikipedia (16GB): This is the data on which BERT is trained.
(2) CC-News (76GB): The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.
(3) OpenWebText (38GB): Open Source recreation of the WebText dataset used to train OpenAI GPT.
(4) Stories (31GB): A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.