STAT946F20/BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

From statwiki
Jump to navigation Jump to search

Presented by

Wenyu Shen

Introduction

This paper introduces the structure of the BERT model. The full name of the BERT model is Bidirectional Encoder Representations from Transformers, and this language model breaks records in eleven natural language process tasks.

Transformer and BERT

Let us start with the introduction of encoder and decoder. From the class, the encoder-decoder model is applied in the seq2seq question. For the sea2seq question, if we input a sequence x, then through performing the encoder-decoder model, we could generate another output sequence y based on x (like translation, questions with answer system). However, while using the RNN or other models as the basic architecture of encoder-decoder, the model might not have great performance while the input source is too long. Though we can use the encoder-decoder with attention which does not merge all the output into one context(layer), the paper Attention is All You Need introduce a framework and only use Attention in the encoder-decoder to do the machine translation task. The Transformer utilized the Scaled Dot-Product Attention and the sequential mask in the decoder and usually performs Multi-head attention to derive more features from the different subspace of sentence for the individual token. The transformer trained the positional encoding, which has the same dimension as the word embedding, to obtain the sequential information of the inputs. BERT is built by the N unit of the transformer encoder.

Table 1: Transformer Structure

BERT

BERT works well in both the Feature-based and Fine-tuning approach. Both Feature-based and Fine-tuning structure started with unsupervised learning from source A, while the Feature-based approach keeps the pre-trained parameter fixed while using the labeled source B to train the task-specific model and get the additional feature, while the Fine-tuning approach tunes all parameters when training on the afterword task. This paper improves BERT based on the Fine-tuning approach. Original transformer learned from left to right order. Therefore, BERT used the MLM (masked language model) to pre-trained deep bidirectional Transformers. Also, BERT performs the Next Sentence Prediction task to make the model understand the relationship between sentences. Also, the Input/Output Representation created Token Embeddings, Segment Embeddings, and Position Embeddings to make BERT accomplish a variety of downstream tasks. Also, during this paper, the randomly selected tokens in MLM are not always utilized by mask to solve the unmatched issue while pre-train and fine-tuning models.

Table 2: Token embedding

Conclusion

Consequently, BERT is a powerful pre-trained model in a large number of unsupervised resources and contributes when we want to perform NLP tasks with a low amount of obtained data.


Table 3: Performance of BERT in multiple datasets


References

[1] Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin. "Attention Is All You Need". (2017)

[2] Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language".(2019)