Roberta: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 3: Line 3:


== Introduction ==
== Introduction ==
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta is a replication of BERT pretraining which is trying to investigate the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which is trying to investigate the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.


== Background ==
== Background ==
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture with 2 training objectives, they uses maskes language modeling(MLM) and next sentence prediction(NSP) as their objectives. In the MLM objectives
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP

Revision as of 18:03, 29 November 2020

Presented by

Danial Maleki

Introduction

Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which is trying to investigate the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.

Background

In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP