Bag of Tricks for Efficient Text Classification

From statwiki
Revision as of 22:33, 19 March 2018 by Ashchow (talk | contribs) (N-Gram, Bag of Words, and TFIDF)
Jump to: navigation, search

Introduction and Motivation

Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier which is inexpensive in terms of training and test time can approximate the performance of these more complex neural networks.

The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance. The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”


  • PLACEHOLDER: we should look at when this

Natural-Language Processing

  • Briefly describe the difference between NLP and text-mining. Maybe comment later about whether fastText accomplishes NLP.

  • might want to briefly mention types of models that are used in the experiment for comparison


Model Architecture of fastText

model image.png

An efficient standard for sentence classification can be created by representing sentences as bag of words (BoW) and training a linear classifier, such as a logistic regression or a soft vector machine (SVM).

Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples (low frequency) will often get classified in a large output field. The model in the paper is built on top of a linear model with a rank constraint and a fast loss approximation. The image above illustrates a simple linear model with a rank constraint.

Some of the most common solutions to this problem are to either factorize the linear classifier into low rank matrices or to use multi-layer neural networks

Each [math] N [/math] represents a seperate [math] N [/math]-th gram features in the sentence. This feature will be explained in a coming section.

Softmax and Hierarchy Softmax

Softmax function f is used to compute the probability density over the predefined classes. The softmax output layer with log-likelihood is given in the article as:

  • [math] - \frac{1}{N} \sum_{n=1}^N y_n \log ( f(BAx_n))[/math] (is this the right one?)

In this formula. A and B are weight matrix which will be calculated in the training set. [math] X_n [/math] is the normalizefeature of the [math] n-th [/math] documentation. [math] Y_n [/math] is the label.

Remark: Negatively log-likelihood is a multiclass cross-entropy. What this means is that for a binary problem (dog or not dog), it will output two values between [0,1] where the sum of the two values equates to 1. (Dog = 0.6, Cat = 0.4). This can further be expanded into larger dimensions. In contrast, sigmoid outputs one value and in the binary case, the other value can be derived via 1 - p.

Softmax will have a complexity of O(kh) where k is the number of classes and h is the number of dimensions of text representation. The function that the authors used for their model was a variation of the softmax function, known as Hiearchy Softmax. The hiearchy softmax is based on the Huffman Coding Tree and will reduce complexity to O(H*log2(k)).

N-Gram, Bag of Words, and TFIDF

Bag of Words

Bag of word is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.

The main weakness of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set.


N-gram is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words.

In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.

Let a sentence be denoted as a product of words 1 to word n. 1aaa.png. By probabilities properties, we can model the probability of the word sequence 1 with Bigram as 2aaa.png. For example, take the sentence, "How long can this go on?" We can model it as followed:

         P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)

Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: 3aaa.png.

We can generalize this to the stronger case for N-th gram as 4aaa.png.

The weakness with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:

         The woman who lives on the fifth floor of the apartment is pretty.
         The women who lives on the fifth floor of the apartment are pretty.

You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.

BoW, Unigram, Bigram Example

An example of this is found in the below example

A = “I love apple”

B = “apple love I”

C = “I love sentence”

Caption: Unigram.
I 1 1 1
love 1 1 0
apple 1 1 0
sentence 0 0 1

Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of order does not matter!

Caption: Bigram.
I love 1 0 1
love apple 1 0 0
apple love 0 1 0
love i 0 1 0
love sentence 0 0 1

Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.

Feature Hashing

Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.

A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function [math] h [/math] that maps features to the value of corresponding dictionary key.

Key Index
I 0
love 1
hate 2
cats 3
dogs 4
but 5
Mary 6

In this case, [math] h(\text{"cats"}) = 3 [/math]. Considering the sentence [math] \text{"I love cats, but Mary hate cats"} [/math] and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence [math] x = ["I", "love", "cats", "but", "Mary", "hate", "cats"] [/math]. Consider the hashed feature map [math] \phi [/math] is calculated by

[math] \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 [/math], where [math] i [/math] is the corresponding index of the hashed feature map.

By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be

0 1 2 3 4 5 6
1 1 1 2 0 1 1

There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.

Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:

0 1 2 3 4 5 6
2 1 1 2 0 1 0

In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in Weinberger et al.2009, which introduces another hash function [math] \xi [/math] to determine the sign of the return index. The hashed feature map [math] \phi [/math] now becomes

[math] \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 [/math]

Consider if [math] \xi("I") = 1 \text{ and } \xi("Mary") = -1 [/math], then our signed hash map now becomes:

0 1 2 3 4 5 6
0 1 1 2 0 1 0

Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.


For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for term frequency–inverse document frequency. It represent the importance of a word to the document.

Term Frequency(TF) generally measures the times that a word occurs in a document. An Inverse Document Frequency(IDF) can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".

TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as [math]\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)[/math]

In this paper, TFIDF is calculated in the same way as Zhang et al., 2015, with

  • [math] \mathrm{tf}(t,d) = f_{t,d} [/math], where [math] f_{t,d} [/math] is the raw count of [math] t [/math] for document [math] d [/math].
  • [math] \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) [/math], where [math] N [/math] is the total number of documents and [math] | \{d\in D:t\in d \} | [/math] is the total number of documents that contains word [math] t [/math].


fastText was compared with various other text classifiers in two classification problems:

  • Sentiment Analysis
  • Tag prediction


Commentary and Criticism

Further Reading

  • List of previous paper presentations in chronological order relating to text classification/fastText