ALBERT: A Lite BERT for Self-supervised Learning of Language Representations: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 13: Line 13:


==Model details==
==Model details==
The fundamental structure of ALBERT is the same as BERT i.e. it uses transformer encoder with GELU  nonlinearities. The authors set the feed-forward/filter size to be 4*H and the number of attention heads to be H/64 where H is the size of the hidden layer. Now we explain the changes the have made to the BERT


===Factorized embedding parameterization===
===Factorized embedding parameterization===
In BERT (as well as subsequent models like XLNet and RoBERTa) we have E=H i.e. the size of the vocabulary embedding (E) and the size of the hidden layer (H) are tied together. This is not an efficient choice because we may need to have a large hidden layer but not a large vocabulary embedding layer. This is actually the case in many applications because the vocabulary embedding ‘E’ is meant to learn context-independent representations while the hidden-layer embedding ‘H’ is meant to learn context-dependent representation which usually is harder. However if we increase H and E together, it will result in a huge increase in the number of parameters because the size of the vocabulary embedding matrix is V×E where V is usually quite large (V is the size of the vocabulary and equals 30000 in both BERT and ALBERT).
The authors proposed the following solution to the problem:
Do not project one-hot vectors directly into hidden space, instead first project one-hot vectors into a lower dimensional space of size E and then project it to the hidden layer. This reduces embedding parameters from O(V×H) to O(V×E+E×H) which is significant when H is much larger than E.


===Cross-layer parameter sharing===
===Cross-layer parameter sharing===

Revision as of 20:07, 2 November 2020

Presented by

Maziar Dadbin

Introduction

In this paper, the authors have made some changes to the BERT model and the result is ALBERT, a model that out-performs BERT on GLUE, SQuAD, and RACE benchmarks. The important point is that ALBERT has fewer number of parameters than BERT-large, but still it gets better results. The above mentioned changes are Factorized embedding parameterization and Cross-layer parameter sharing which are two methods of parameter reduction. They also introduced a new loss function and replaced it with one of the loss functions being used in BERT (i.e. NSP). The last change is removing dropout from the model.


Motivation

In natural language representations, larger models often result in improved performance. However, at some point GPU/TPU memory and training time constraints limit our ability to increase the model size any further. There exist some attempts to reduce the memory consumption but at the cost of speed (look at Chen et al. (2016), Gomez et al. (2017), and also Raffel et al. (2019)). The authors of this paper claim that there parameter reduction techniques reduce memory consumption and increase training speed.



Model details

The fundamental structure of ALBERT is the same as BERT i.e. it uses transformer encoder with GELU nonlinearities. The authors set the feed-forward/filter size to be 4*H and the number of attention heads to be H/64 where H is the size of the hidden layer. Now we explain the changes the have made to the BERT


Factorized embedding parameterization

In BERT (as well as subsequent models like XLNet and RoBERTa) we have E=H i.e. the size of the vocabulary embedding (E) and the size of the hidden layer (H) are tied together. This is not an efficient choice because we may need to have a large hidden layer but not a large vocabulary embedding layer. This is actually the case in many applications because the vocabulary embedding ‘E’ is meant to learn context-independent representations while the hidden-layer embedding ‘H’ is meant to learn context-dependent representation which usually is harder. However if we increase H and E together, it will result in a huge increase in the number of parameters because the size of the vocabulary embedding matrix is V×E where V is usually quite large (V is the size of the vocabulary and equals 30000 in both BERT and ALBERT). The authors proposed the following solution to the problem: Do not project one-hot vectors directly into hidden space, instead first project one-hot vectors into a lower dimensional space of size E and then project it to the hidden layer. This reduces embedding parameters from O(V×H) to O(V×E+E×H) which is significant when H is much larger than E.



Cross-layer parameter sharing

Inter-sentence coherence loss

Removing dropout