ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
In this paper, the authors have made some changes to the BERT model and the result is ALBERT, a model that out-performs BERT on GLUE, SQuAD, and RACE benchmarks. The important point is that ALBERT has fewer number of parameters than BERT-large, but still it gets better results. The above mentioned changes are Factorized embedding parameterization and Cross-layer parameter sharing which are two methods of parameter reduction. They also introduced a new loss function and replaced it with one of the loss functions being used in BERT (i.e. NSP). The last change is removing dropout from the model.
In natural language representations, larger models often result in improved performance. However, at some point GPU/TPU memory and training time constraints limit our ability to increase the model size any further. There exist some attempts to reduce the memory consumption but at the cost of speed (look at Chen et al. (2016), Gomez et al. (2017), and also Raffel et al. (2019)). The authors of this paper claim that there parameter reduction techniques reduce memory consumption and increase training speed.
The fundamental structure of ALBERT is the same as BERT i.e. it uses transformer encoder with GELU nonlinearities. The authors set the feed-forward/filter size to be 4*H and the number of attention heads to be H/64 where H is the size of the hidden layer. Next we explain the changes the have been applied to the BERT.
Factorized embedding parameterization
In BERT (as well as subsequent models like XLNet and RoBERTa) we have E=H i.e. the size of the vocabulary embedding (E) and the size of the hidden layer (H) are tied together. This is not an efficient choice because we may need to have a large hidden layer but not a large vocabulary embedding layer. This is actually the case in many applications because the vocabulary embedding ‘E’ is meant to learn context-independent representations while the hidden-layer embedding ‘H’ is meant to learn context-dependent representation which usually is harder. However if we increase H and E together, it will result in a huge increase in the number of parameters because the size of the vocabulary embedding matrix is V×E where V is usually quite large (V is the size of the vocabulary and equals 30000 in both BERT and ALBERT). The authors proposed the following solution to the problem: Do not project one-hot vectors directly into hidden space, instead first project one-hot vectors into a lower dimensional space of size E and then project it to the hidden layer. This reduces embedding parameters from O(V×H) to O(V×E+E×H) which is significant when H is much larger than E.
Cross-layer parameter sharing
Another method the authors used for reducing the number of parameters is to share the parameters across layers. There are different strategies for parameter sharing for example one may only share feed-forward network parameters or only sharing attention parameters. However, the default choice for ALBERT is to simply share all parameters across layers. The following table shows the effect of different parameter sharing strategies in two setting for the vocabulary embedding size. As we can see in both cases, sharing all the parameters have a negative effect on the accuracy where most of this effect comes from sharing the FFN parameters not from sharing attention parameters. Given this, the authors have decided to share all the parameters across the layers which result in much smaller number of parameters which in turn enable them to have larger hidden layers and this is how they compensate what they have lost from parameter sharing.