Bag of Tricks for Efficient Text Classification
Neural network have been utilized more recently for Text-Classifications and demosntated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large dataset. The authors suggests that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performances. The basis of the analysis for this paper were the approach of fastText on the two tasks: tag predictions, and sentiment analysis. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”
Model Architecture of fastText
- PLACE HOLDER FOR IMAGE FROM ARTICLE*
Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples (low frequency) will often get classified in a large output field. The model in the paper is built on top of a linear model with a rank constraint and a fast loss approximation.
Each N represents a seperate N-th gram features in the sentence. This feature will be explained in a coming section.
Softmax and Hierarchy Softmax
Softmax function f is used to compute the probability density over the predefined classes. The softmax output layer with log-likelihood is given in the article as:
- PLACE HOLDER for log-likelihood error function found in article*
In this formula. A and B are weight matrix which will be calculated in the training set. Xn is the normalizefeature of the n-th documentation. Y n is the label.
Remark: Negatively log-likelihood is a multiclass cross-entropy. What this means is that for a binary problem (dog or not dog), it will output two values between [0,1] where the sum of the two values equates to 1. (Dog = 0.6, Cat = 0.4). This can further be expanded into larger dimensions. In contrast, sigmoid outputs one value and in the binary case, the other value can be derived via 1 - p.
Softmax will have a complexity of O(kh) where k is the number of classes and h is the number of dimensions of text representation. The function that the authors used for their model was a variation of the softmax function, known as Hiearchy Softmax. The hiearchy softmax is based on the Huffman Coding Tree and will reduce complexity to O(H*log2(k)).