Roberta: Difference between revisions
No edit summary |
No edit summary |
||
Line 6: | Line 6: | ||
== Background == | == Background == | ||
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP | In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD[7], RACE[8] and show their performance on those downstream tasks. | ||
== Training Procedure Analysis == | |||
In this section, they elaborate on which choices are important for successfully pretraining BERT. First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining is masking a few tokens from each sequence at random and then predicting these tokens. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps. | |||
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model.[[File:Example.jpg]] |
Revision as of 17:23, 29 November 2020
Presented by
Danial Maleki
Introduction
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which is trying to investigate the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.
Background
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD[7], RACE[8] and show their performance on those downstream tasks.
Training Procedure Analysis
In this section, they elaborate on which choices are important for successfully pretraining BERT. First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining is masking a few tokens from each sequence at random and then predicting these tokens. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps. Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model.