Convolutional Neural Networks for Sentence Classification
This is a summary of the paper, Convolutional Neural Networks for Sentence Classification by Yoon Kim [1].
Presented by
- M.Bayati
- S.Malekmohammadi
- V.Luong
Introduction
In this paper, sentence classification using convolutional neural networks is studied. Each sentence is encoded by concatenation of the word vectors of its words and the encoded representation is fed to a model consisting of a convolutional layer followed by a dense layer for doing the classification task. Different variants of this model have been introduced by the authors, two of which try to learn task-specific word vectors for words. It is observed that learning task-specific vectors (instead of using the pre-trained vectors without any change) offers further gains in performance.
The Used Model and Results
Using neural models to learn vector representation for words is one of the most important contributions within natural language processing. The vector representations are obtained from projecting 1-hot representation of words (a sparse representation) onto a lower dimensional space. Continuous Bag of Words (COBW) neural language model, trained by Mikolov et al. (2013), is one of the unsupervised algorithms providing such a low dimensional vector representation in which semantic features of words are encoded. Having the low-dimensional vector representations, one can feed them to different models for doing different tasks. For instance, they can be fed to CNNs for document or sentence classification. The vector representations used in this paper are obtained from the CBOW.
The model that the authors use constitutes of a convolutional layer followed by a dense layer. In the convolutional layer, there are $m$ different filters (kernels) each resulting in one feature map. The resulting $m$ feature maps form the penultimate layer which is passed to a fully connected softmax layer, as shown in the following:
The authors introduce several variants of the model that are briefly explained in the following:
They first consider a baseline model, which is called CNN-rand. In this variant, they do not use any pretrained vector representation for words. All words are assigned a random vector representation and the assigned vectors are fed to the model. The random vectors get modified during training. The authors observed that the variant does not perform well compared to the other variants. \newline
In the second variant of the model, which is called CNN-static, they use the pretrained word2vec vectors. They keep the pretrained vectors static; i.e., during training, the vectors do not change and only the other parameters of the model (edge weights and kernels) get learned. They observed that this simple model achieves excellent results on multiple benchmarks. Note that the used word vectors are pre-trained (regardless of the given classification task and its data set) and the model achieves excellent results when using them, while when feeding another set of publicly available word vectors (trained by Collobert et al. (2011) on Wikipedia) to the same model, the performance of the model is not as good as when word2vec word vectors are used. Based on this observation, the authors stated that the pre-trained word2vec vectors are good encoded representation for words and they can be utilized for different classification tasks.
The third variant that the authors consider is called CNN-non-static. Here is the first time that the authors try to learn task-specific word vectors for words. In the model, they use the pre-trained word2vec vectors to initialize the task-specific vectors, but after getting initialized, the vectors get fine-tuned during training via backpropagation. They observed that learning task-specific vectors through fine-tuning results in further improvements compared the case when the pre-trained word2vec vectors are used without any change. \newline
In the fourth variant introduced for the model (CNN-multichannel), the authors add a simple modification to the structure of the model to make it capable of using both the pre-trained and the task-specific vectors. What they have done, is adding another channel of inputs to the model structure as shown in the following:
In this model, as shown, there are two sets (channels) of word vectors. The first one is the pre-trained word2vec vectors that are static and do not change during training. The other channel is initialized with the pretrained word2vec vectors, but it gets fine-tuned during training via backpropagation. They observed that this model, similar to CNN-non-static, results in further improvement of the model when doing classification tasks on different data sets.
Conclusion
In both the CNN-non-static and the CNN-multichannel, they observed that the models are able to fine-tune the word vectors to make them more specific for each given task. For example, they observed that in word2vec, the most similar word to “good” is “great”, while “nice” seems closer to that as long as the goal is expressing sentiment. This can be observed in the learned vectors reflected by CNN-non-static and CNN-multichannel: for the word vectors in CNN-non-static and those in the second channel of CNN-multichannel, the most similar word to “good” is “nice”. So fine-tuning allows the model to learn more meaningful representation for words depending on the task in hand. This can be counted as the most important contribution of the paper, which adds improvement to the performance of the model compared to when it uses the pre-trained word2vec vectors in a static way and regardless of the given task.
References
- [1]Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). doi:10.3115/v1/d14-1181.