ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

From statwiki
Revision as of 19:28, 2 November 2020 by Mdadbin (talk | contribs)
Jump to: navigation, search

Presented by

Maziar Dadbin


In this paper, the authors have made some changes to the BERT model and the result is ALBERT, a model that out-performs BERT on GLUE, SQuAD, and RACE benchmarks. The important point is that ALBERT has fewer number of parameters than BERT-large, but still it gets better results. The above mentioned changes are Factorized embedding parameterization and Cross-layer parameter sharing which are two methods of parameter reduction. They also introduced a new loss function and replaced it with one of the loss functions being used in BERT (i.e. NSP). The last change is removing dropout from the model.


Model details

Factorized embedding parameterization

Cross-layer parameter sharing

Inter-sentence coherence loss


Removing dropout