http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Cs3yang&feedformat=atomstatwiki - User contributions [US]2023-06-01T10:27:05ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35305Bag of Tricks for Efficient Text Classification2018-03-22T21:42:38Z<p>Cs3yang: /* Dataset */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
<br />
=== Natural-Language Processing and Text Classification ===<br />
[https://en.wikipedia.org/wiki/Natural-language_processing Natural Language Processing] (NLP) is concerned with being able to process large amounts of natural language data, involving speech recognition, natural language understanding, and natural-language generation. Text understanding involves being able to understand the explicit or implicit meaning of elements of text such as words, phrases, sentences, and paragraphs, and making inferences about these properties of texts (Norvig, 1987). One of the main topics in NLP is text classification, which is assigning predefined categories to free-text documents, with research ranging in this field from designing the best features to choosing the best machine learning classifiers (Zhang et al. 2016). Traditionally, techniques for text classification are based on simple statistics on words that use linear classifiers such as Bag of Words and N-grams. <br />
<br />
With the advancement of deep learning and the availability of large data sets, methods of handling text understanding using deep learning techniques have become popular in recent years. These deep learning models have been shown to significantly perform better than these traditional models in several studies. The following are the deep learning models that are compared to fastText in the Experiment Section:<br />
<br />
=== Char-CNN ===<br />
<br />
=== Char-CRNN ===<br />
<br />
A chactacter-level convolutional neural network (CRNN) is a a convolutional neural network (CNN) with a recurrent layer on top. A recurrent layer can efficiently capture long-term dependencies, therefore a model with only a very small number of convolutional layers is needed.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. <br />
<br />
[[File:xiao and cho 2016 pic.png]]<br />
Graphical illustration of (a) a CNN and (b) proposed CRNN for character-level document classification <br />
<br />
Source: [Xiao and Cho2016] Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv: 1602.00367.<br />
<br />
=== VDCNN ===<br />
<br />
=== Tagspace ===<br />
<br />
[http://emnlp2014.org/papers/pdf/EMNLP2014194.pdf Tagspace] is a convolution neural network which aims to predict hashtags in social network posts. In this context, hashtags are used diversely, as identifiers, sentiments, topic annotations, and more. Tagspace uses "embeddings" which are vector representations of text which are then combined with some function which produces a point on the embedding space. Because Tagspace also captures the semantic context of hashtags, it is a strong model in terms of NLP learning. Tagspace was also applied to a document recommendation problem where the next item a user will interact with was predicted based on their previous history.<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of <math>y_n</math> as each possible class and outputs probability estimates for each class. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single <math>y_n</math> to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for <math>y_n</math>. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed deeper into the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path for more frequently labelled classes. Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>. <br />
<br />
A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf </i><br />
<br />
In the below figure, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
[[File:figure2.jpg|Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
<br />
Note that during the training process, vector representations for inner nodes, <math>v_{n_i}</math>, are updated by introducing an error function, <math>Err</math> and deriving the error function to obtain the following:<br />
<div style="text-align: center;"> <math> \frac{\partial Err}{\partial v_{n_i}^{'}}=\frac{\partial Err}{\partial v_{n_i}^{'}h} \cdot \frac{\partial v_{n_i}^{'}h }{\partial v_{n_i}^{'}} </math> <br></div><br />
<br />
We can use this to update the vector values with the following:<br />
<br />
<div style="text-align: center;"> <math> v_{n_i}'(new) = v_{n_i}'(old) - &eta; \frac{\partial Err}{\partial v_{n_i}'} </math><br></div><br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
From above, we can see that even though N-gram keeps words order features; however, the dimension of our feature vector or matrix increases as N increases. Not all combinations of words are common in real life, which may cause our feature matrix be very disperse, and computationally expensive. In order to address this problem, we introduce the '''hash trick''', aka '''feature hashing'''. <br />
<br />
Feature hashing, can be used in sentence classification which maps feature vectors to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map feature vectors from high dimensional to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. It can be generally expressed as <math> h(n):\{1,...,N\} \rightarrow \{1,...,M\} </math>. A very simple hash function is the modulo operation. Considering if we are mapping a key to a hash table of M slots. Then, a simple hash function can be defined as <br />
<br />
h(key) = key % M<br />
<br />
<br />
Since for N-grams model, word count is usually used as features. Therefore, the hashed feature map can be easily calculated as<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} x_j </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here, the spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
{| class="wikitable"<br />
|-<br />
! Input<br />
! Prediction<br />
! Tags<br />
|-<br />
| taiyoucon 2011 digitals: individuals digital photos from the anime convention taiyoucon 2011 in mesa, arizona. if you know the model and/or the character, please comment.<br />
| #cosplay<br />
| #24mm #anime #animeconvention #arizona #canon #con #convention #cos '''#cosplay''' #costume #mesa #play #taiyou #taiyoucon<br />
|-<br />
| 2012 twin cities pride 2012 twin cities pride parade<br />
| #minneapolis <br />
| #2012twincitiesprideparade '''#minneapolis''' #mn #usa<br />
|-<br />
| beagle enjoys the snowfall<br />
| #snow<br />
| #2007 #beagle #hillsboro #january #maddison #maddy #oregon '''#snow'''<br />
|-<br />
| christmas <br />
| #christmas <br />
| #cameraphone #mobile<br />
|-<br />
| euclid avenue <br />
| #newyorkcity <br />
| #cleveland #euclidavenue<br />
|}<br />
<br />
<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags.<br />
<br />
==== Baselines====<br />
A frequency-based approach which simply predicts the most frequent tag is used as a baseline.<br />
<br />
The model we compare fastText to in tag prediction is [http://www.aclweb.org/anthology/D14-1194 Tagspace]. It is similar to our model but is based on the [http://www.thespermwhale.com/jaseweston/papers/wsabie-ijcai.pdf WSABIE] which predicts tags based on both images and their text annotations. The Tagspace model is a convolution neural network, which predicts hashtags in semantic contexts of social network posts. For faster and better comparable performance, the experiment uses a linear version of this model. Tagspace also predicts multiple hashtags so this experiment does not fully compare the capabilities of the Tagspace model.<br />
<br />
==== Results ====<br />
<br />
{| class="wikitable"<br />
|-<br />
! Model <br />
! prec@1<br />
! Running time - Train<br />
! Running time - Test<br />
|- <br />
| Freq. baseline <br />
| 2.2 <br />
| - <br />
| -<br />
|- <br />
| Tagspace, h = 50 <br />
| 30.1 <br />
| 3h8 <br />
| 6h<br />
|- <br />
| Tagspace, h = 200 <br />
| 35.6 <br />
| 5h32 <br />
| 15h<br />
|- <br />
| fastText, h = 50 <br />
| 30.8 <br />
| 6m40 <br />
| 48s<br />
|- <br />
| fastText, h = 50, bigram <br />
| 35.6 <br />
| 7m47 <br />
| 50s<br />
|- <br />
| fastText, h = 200 <br />
| 40.7 <br />
| 10m34 <br />
| 1m29<br />
|-<br />
| fastText, h = 200, bigram <br />
| 45.1 <br />
| 13m38 <br />
| 1m37<br />
|}<br />
<br />
<br />
<br />
The above table presents a comparison of our fastText model to other baselines. The "Precision-at-one" (Prec@1) metric reports the proportion that highest ranking tag the the model predicts is in fact one of the real tags for an item.<br />
<br />
<br />
We can see the frequency baseline has the lowest accuracy. On running fastText for 5 epochs and we compare it to Tagspace results. fastText with bigrams is also used and models are tested with 50 and 200 hidden layers. With 50 hidden layers, both fastText (without bigrams) and Tagspace performed similarly while fastText with bigrams performed better at 50 hidden layers than Tagspace did at 200. At 200 hidden layers, fastText performs better still and with bigrams, accuracy is improved. <br />
<br />
<br />
Finally, both the train and test times of our model performed significantly better than Tagspace which must compute the scores for each class. fastText on the other hand, has a fast inference which gives it an advantage for a large number of classes. In particular, test times for fastText are significantly faster than that of Tagspace.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
<br />
=== fastText and baseline models ===<br />
<br />
The performance of fastText was compared to several baselines related to text classification. While this paper focused on the ways fastText outperformed these classifier, the baseline classifiers were not always evaluated at their optimal performance. <br />
<br />
* '''Tagspace''' was used as a comparison classifier in a tag prediction problem. However, the goal of this model is to predict hashtags as semantic embedding rather than image tags as in the experiment. Additionally, only a linear version of Tagspace was used as a comparison.<br />
<br />
== Sources ==<br />
<br />
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of Tricks for Efficient Text Classification. ''Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers''. doi:10.18653/v1/e17-2068<br />
<br />
Weston, J., Bengio, S., Usunier, N. (2011). Wsabie: Scaling Up To Large Vocabulary Image Annotation. ''Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI (2011)''. url: http://www.thespermwhale.com/jaseweston/papers/wsabie-ijcai.pdf<br />
<br />
Weston, J., Chopra, S., & Adams, K. (2014). #TagSpace: Semantic Embeddings from Hashtags. ''Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)''. doi:10.3115/v1/d14-1194<br />
<br />
== Further Reading ==<br />
<br />
* [https://fasttext.cc/ fastText official website] - Resources for using fastText<br />
<br />
* [https://research.fb.com/fasttext/ Facebook Research: fastText] - Overview of fastText released by Facebook Research.<br />
<br />
* [https://github.com/facebookresearch/fastText fastText GitHub] <br />
<br />
* [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441w18/A_New_Method_of_Region_Embedding_for_Text_Classification A New Method of Region Embedding for Text Classification (Summary)] - Paper summary describing a method of preserving local structure information with small text regions for text classification tasks. <br />
<br />
* [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441w18/Convolutional_Neural_Networks_for_Sentence_Classification Convolution Neural Networks for Sentence Classification (Summary)] - Paper summary describing applying four variations of Convolutional Neural Networks to several NLP tasks such as sentiment analysis, customer review prediction, movie reviews, and more.</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35304Bag of Tricks for Efficient Text Classification2018-03-22T21:41:37Z<p>Cs3yang: /* Sources */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
<br />
=== Natural-Language Processing and Text Classification ===<br />
[https://en.wikipedia.org/wiki/Natural-language_processing Natural Language Processing] (NLP) is concerned with being able to process large amounts of natural language data, involving speech recognition, natural language understanding, and natural-language generation. Text understanding involves being able to understand the explicit or implicit meaning of elements of text such as words, phrases, sentences, and paragraphs, and making inferences about these properties of texts (Norvig, 1987). One of the main topics in NLP is text classification, which is assigning predefined categories to free-text documents, with research ranging in this field from designing the best features to choosing the best machine learning classifiers (Zhang et al. 2016). Traditionally, techniques for text classification are based on simple statistics on words that use linear classifiers such as Bag of Words and N-grams. <br />
<br />
With the advancement of deep learning and the availability of large data sets, methods of handling text understanding using deep learning techniques have become popular in recent years. These deep learning models have been shown to significantly perform better than these traditional models in several studies. The following are the deep learning models that are compared to fastText in the Experiment Section:<br />
<br />
=== Char-CNN ===<br />
<br />
=== Char-CRNN ===<br />
<br />
A chactacter-level convolutional neural network (CRNN) is a a convolutional neural network (CNN) with a recurrent layer on top. A recurrent layer can efficiently capture long-term dependencies, therefore a model with only a very small number of convolutional layers is needed.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. <br />
<br />
[[File:xiao and cho 2016 pic.png]]<br />
Graphical illustration of (a) a CNN and (b) proposed CRNN for character-level document classification <br />
<br />
Source: [Xiao and Cho2016] Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv: 1602.00367.<br />
<br />
=== VDCNN ===<br />
<br />
=== Tagspace ===<br />
<br />
[http://emnlp2014.org/papers/pdf/EMNLP2014194.pdf Tagspace] is a convolution neural network which aims to predict hashtags in social network posts. In this context, hashtags are used diversely, as identifiers, sentiments, topic annotations, and more. Tagspace uses "embeddings" which are vector representations of text which are then combined with some function which produces a point on the embedding space. Because Tagspace also captures the semantic context of hashtags, it is a strong model in terms of NLP learning. Tagspace was also applied to a document recommendation problem where the next item a user will interact with was predicted based on their previous history.<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of <math>y_n</math> as each possible class and outputs probability estimates for each class. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single <math>y_n</math> to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for <math>y_n</math>. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed deeper into the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path for more frequently labelled classes. Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>. <br />
<br />
A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf </i><br />
<br />
In the below figure, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
[[File:figure2.jpg|Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
<br />
Note that during the training process, vector representations for inner nodes, <math>v_{n_i}</math>, are updated by introducing an error function, <math>Err</math> and deriving the error function to obtain the following:<br />
<div style="text-align: center;"> <math> \frac{\partial Err}{\partial v_{n_i}^{'}}=\frac{\partial Err}{\partial v_{n_i}^{'}h} \cdot \frac{\partial v_{n_i}^{'}h }{\partial v_{n_i}^{'}} </math> <br></div><br />
<br />
We can use this to update the vector values with the following:<br />
<br />
<div style="text-align: center;"> <math> v_{n_i}'(new) = v_{n_i}'(old) - &eta; \frac{\partial Err}{\partial v_{n_i}'} </math><br></div><br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
From above, we can see that even though N-gram keeps words order features; however, the dimension of our feature vector or matrix increases as N increases. Not all combinations of words are common in real life, which may cause our feature matrix be very disperse, and computationally expensive. In order to address this problem, we introduce the '''hash trick''', aka '''feature hashing'''. <br />
<br />
Feature hashing, can be used in sentence classification which maps feature vectors to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map feature vectors from high dimensional to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. It can be generally expressed as <math> h(n):\{1,...,N\} \rightarrow \{1,...,M\} </math>. A very simple hash function is the modulo operation. Considering if we are mapping a key to a hash table of M slots. Then, a simple hash function can be defined as <br />
<br />
h(key) = key % M<br />
<br />
<br />
Since for N-grams model, word count is usually used as features. Therefore, the hashed feature map can be easily calculated as<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} x_j </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here, the spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
{| class="wikitable"<br />
|-<br />
! Input<br />
! Prediction<br />
! Tags<br />
|-<br />
| taiyoucon 2011 digitals: individuals digital photos from the anime convention taiyoucon 2011 in mesa, arizona. if you know the model and/or the character, please comment.<br />
| #cosplay<br />
| #24mm #anime #animeconvention #arizona #canon #con #convention #cos '''#cosplay''' #costume #mesa #play #taiyou #taiyoucon<br />
|-<br />
| 2012 twin cities pride 2012 twin cities pride parade<br />
| #minneapolis <br />
| #2012twincitiesprideparade '''#minneapolis''' #mn #usa<br />
|-<br />
| beagle enjoys the snowfall<br />
| #snow<br />
| #2007 #beagle #hillsboro #january #maddison #maddy #oregon '''#snow'''<br />
|-<br />
| christmas <br />
| #christmas <br />
| #cameraphone #mobile<br />
|-<br />
| euclid avenue <br />
| #newyorkcity <br />
| #cleveland #euclidavenue<br />
|}<br />
<br />
<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags. The precision of the model on the test set is reported at 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]). (Need to figure out what this means, maybe move this line later on).<br />
<br />
==== Baselines====<br />
A frequency-based approach which simply predicts the most frequent tag is used as a baseline.<br />
<br />
The model we compare fastText to in tag prediction is [http://www.aclweb.org/anthology/D14-1194 Tagspace]. It is similar to our model but is based on the [http://www.thespermwhale.com/jaseweston/papers/wsabie-ijcai.pdf WSABIE] which predicts tags based on both images and their text annotations. The Tagspace model is a convolution neural network, which predicts hashtags in semantic contexts of social network posts. For faster and better comparable performance, the experiment uses a linear version of this model. Tagspace also predicts multiple hashtags so this experiment does not fully compare the capabilities of the Tagspace model.<br />
<br />
==== Results ====<br />
<br />
{| class="wikitable"<br />
|-<br />
! Model <br />
! prec@1<br />
! Running time - Train<br />
! Running time - Test<br />
|- <br />
| Freq. baseline <br />
| 2.2 <br />
| - <br />
| -<br />
|- <br />
| Tagspace, h = 50 <br />
| 30.1 <br />
| 3h8 <br />
| 6h<br />
|- <br />
| Tagspace, h = 200 <br />
| 35.6 <br />
| 5h32 <br />
| 15h<br />
|- <br />
| fastText, h = 50 <br />
| 30.8 <br />
| 6m40 <br />
| 48s<br />
|- <br />
| fastText, h = 50, bigram <br />
| 35.6 <br />
| 7m47 <br />
| 50s<br />
|- <br />
| fastText, h = 200 <br />
| 40.7 <br />
| 10m34 <br />
| 1m29<br />
|-<br />
| fastText, h = 200, bigram <br />
| 45.1 <br />
| 13m38 <br />
| 1m37<br />
|}<br />
<br />
<br />
<br />
The above table presents a comparison of our fastText model to other baselines. The "Precision-at-one" (Prec@1) metric reports the proportion that highest ranking tag the the model predicts is in fact one of the real tags for an item.<br />
<br />
<br />
We can see the frequency baseline has the lowest accuracy. On running fastText for 5 epochs and we compare it to Tagspace results. fastText with bigrams is also used and models are tested with 50 and 200 hidden layers. With 50 hidden layers, both fastText (without bigrams) and Tagspace performed similarly while fastText with bigrams performed better at 50 hidden layers than Tagspace did at 200. At 200 hidden layers, fastText performs better still and with bigrams, accuracy is improved. <br />
<br />
<br />
Finally, both the train and test times of our model performed significantly better than Tagspace which must compute the scores for each class. fastText on the other hand, has a fast inference which gives it an advantage for a large number of classes. In particular, test times for fastText are significantly faster than that of Tagspace.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
<br />
=== fastText and baseline models ===<br />
<br />
The performance of fastText was compared to several baselines related to text classification. While this paper focused on the ways fastText outperformed these classifier, the baseline classifiers were not always evaluated at their optimal performance. <br />
<br />
* '''Tagspace''' was used as a comparison classifier in a tag prediction problem. However, the goal of this model is to predict hashtags as semantic embedding rather than image tags as in the experiment. Additionally, only a linear version of Tagspace was used as a comparison.<br />
<br />
== Sources ==<br />
<br />
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of Tricks for Efficient Text Classification. ''Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers''. doi:10.18653/v1/e17-2068<br />
<br />
Weston, J., Bengio, S., Usunier, N. (2011). Wsabie: Scaling Up To Large Vocabulary Image Annotation. ''Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI (2011)''. url: http://www.thespermwhale.com/jaseweston/papers/wsabie-ijcai.pdf<br />
<br />
Weston, J., Chopra, S., & Adams, K. (2014). #TagSpace: Semantic Embeddings from Hashtags. ''Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)''. doi:10.3115/v1/d14-1194<br />
<br />
== Further Reading ==<br />
<br />
* [https://fasttext.cc/ fastText official website] - Resources for using fastText<br />
<br />
* [https://research.fb.com/fasttext/ Facebook Research: fastText] - Overview of fastText released by Facebook Research.<br />
<br />
* [https://github.com/facebookresearch/fastText fastText GitHub] <br />
<br />
* [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441w18/A_New_Method_of_Region_Embedding_for_Text_Classification A New Method of Region Embedding for Text Classification (Summary)] - Paper summary describing a method of preserving local structure information with small text regions for text classification tasks. <br />
<br />
* [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441w18/Convolutional_Neural_Networks_for_Sentence_Classification Convolution Neural Networks for Sentence Classification (Summary)] - Paper summary describing applying four variations of Convolutional Neural Networks to several NLP tasks such as sentiment analysis, customer review prediction, movie reviews, and more.</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35300Bag of Tricks for Efficient Text Classification2018-03-22T21:33:47Z<p>Cs3yang: /* Sources */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
<br />
=== Natural-Language Processing and Text Classification ===<br />
[https://en.wikipedia.org/wiki/Natural-language_processing Natural Language Processing] (NLP) is concerned with being able to process large amounts of natural language data, involving speech recognition, natural language understanding, and natural-language generation. Text understanding involves being able to understand the explicit or implicit meaning of elements of text such as words, phrases, sentences, and paragraphs, and making inferences about these properties of texts (Norvig, 1987). One of the main topics in NLP is text classification, which is assigning predefined categories to free-text documents, with research ranging in this field from designing the best features to choosing the best machine learning classifiers (Zhang et al. 2016). Traditionally, techniques for text classification are based on simple statistics on words that use linear classifiers such as Bag of Words and N-grams. <br />
<br />
With the advancement of deep learning and the availability of large data sets, methods of handling text understanding using deep learning techniques have become popular in recent years. These deep learning models have been shown to significantly perform better than these traditional models in several studies. The following are the deep learning models that are compared to fastText in the Experiment Section:<br />
<br />
=== Char-CNN ===<br />
<br />
=== Char-CRNN ===<br />
<br />
A chactacter-level convolutional neural network (CRNN) is a a convolutional neural network (CNN) with a recurrent layer on top. A recurrent layer can efficiently capture long-term dependencies, therefore a model with only a very small number of convolutional layers is needed.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. <br />
<br />
[[File:xiao and cho 2016 pic.png]]<br />
Graphical illustration of (a) a CNN and (b) proposed CRNN for character-level document classification <br />
<br />
Source: [Xiao and Cho2016] Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv: 1602.00367.<br />
<br />
=== VDCNN ===<br />
<br />
=== Tagspace ===<br />
<br />
[http://emnlp2014.org/papers/pdf/EMNLP2014194.pdf Tagspace] is a convolution neural network which aims to predict hashtags in social network posts. In this context, hashtags are used diversely, as identifiers, sentiments, topic annotations, and more. Tagspace uses "embeddings" which are vector representations of text which are then combined with some function which produces a point on the embedding space. Because Tagspace also captures the semantic context of hashtags, it is a strong model in terms of NLP learning. Tagspace was also applied to a document recommendation problem where the next item a user will interact with was predicted based on their previous history.<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of <math>y_n</math> as each possible class and outputs probability estimates for each class. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single <math>y_n</math> to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for <math>y_n</math>. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed deeper into the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path for more frequently labelled classes. Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>. <br />
<br />
A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf </i><br />
<br />
In the below figure, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
[[File:figure2.jpg|Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
<br />
Note that during the training process, vector representations for inner nodes, <math>v_{n_i}</math>, are updated by introducing an error function, <math>Err</math> and deriving the error function to obtain the following:<br />
<div style="text-align: center;"> <math> \frac{\partial Err}{\partial v_{n_i}^{'}}=\frac{\partial Err}{\partial v_{n_i}^{'}h} \cdot \frac{\partial v_{n_i}^{'}h }{\partial v_{n_i}^{'}} </math> <br></div><br />
<br />
We can use this to update the vector values with the following:<br />
<br />
<div style="text-align: center;"> <math> v_{n_i}'(new) = v_{n_i}'(old) - &eta; \frac{\partial Err}{\partial v_{n_i}'} </math><br></div><br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
From above, we can see that even though N-gram keeps words order features; however, the dimension of our feature vector or matrix increases as N increases. Not all combinations of words are common in real life, which may cause our feature matrix be very disperse, and computationally expensive. In order to address this problem, we introduce the '''hash trick''', aka '''feature hashing'''. <br />
<br />
Feature hashing, can be used in sentence classification which maps feature vectors to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map feature vectors from high dimensional to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. It can be generally expressed as <math> h(n):\{1,...,N\} \rightarrow \{1,...,M\} </math>. A very simple hash function is the modulo operation. Considering if we are mapping a key to a hash table of M slots. Then, a simple hash function can be defined as <br />
<br />
h(key) = key % M<br />
<br />
<br />
Since for N-grams model, word count is usually used as features. Therefore, the hashed feature map can be easily calculated as<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} x_j </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here, the spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
{| class="wikitable"<br />
|-<br />
! Input<br />
! Prediction<br />
! Tags<br />
|-<br />
| taiyoucon 2011 digitals: individuals digital photos from the anime convention taiyoucon 2011 in mesa, arizona. if you know the model and/or the character, please comment.<br />
| #cosplay<br />
| #24mm #anime #animeconvention #arizona #canon #con #convention #cos '''#cosplay''' #costume #mesa #play #taiyou #taiyoucon<br />
|-<br />
| 2012 twin cities pride 2012 twin cities pride parade<br />
| #minneapolis <br />
| #2012twincitiesprideparade '''#minneapolis''' #mn #usa<br />
|-<br />
| beagle enjoys the snowfall<br />
| #snow<br />
| #2007 #beagle #hillsboro #january #maddison #maddy #oregon '''#snow'''<br />
|-<br />
| christmas <br />
| #christmas <br />
| #cameraphone #mobile<br />
|-<br />
| euclid avenue <br />
| #newyorkcity <br />
| #cleveland #euclidavenue<br />
|}<br />
<br />
<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags. The precision of the model on the test set is reported at 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]). (Need to figure out what this means, maybe move this line later on).<br />
<br />
==== Baselines====<br />
A frequency-based approach which simply predicts the most frequent tag is used as a baseline.<br />
<br />
The model we compare fastText to in tag prediction is [http://www.aclweb.org/anthology/D14-1194 Tagspace]. It is similar to our model but is based on the [http://www.thespermwhale.com/jaseweston/papers/wsabie-ijcai.pdf WSABIE] which predicts tags based on both images and their text annotations. The Tagspace model is a convolution neural network, which predicts hashtags in semantic contexts of social network posts. For faster and better comparable performance, the experiment uses a linear version of this model. Tagspace also predicts multiple hashtags so this experiment does not fully compare the capabilities of the Tagspace model.<br />
<br />
==== Results ====<br />
<br />
{| class="wikitable"<br />
|-<br />
! Model <br />
! prec@1<br />
! Running time - Train<br />
! Running time - Test<br />
|- <br />
| Freq. baseline <br />
| 2.2 <br />
| - <br />
| -<br />
|- <br />
| Tagspace, h = 50 <br />
| 30.1 <br />
| 3h8 <br />
| 6h<br />
|- <br />
| Tagspace, h = 200 <br />
| 35.6 <br />
| 5h32 <br />
| 15h<br />
|- <br />
| fastText, h = 50 <br />
| 30.8 <br />
| 6m40 <br />
| 48s<br />
|- <br />
| fastText, h = 50, bigram <br />
| 35.6 <br />
| 7m47 <br />
| 50s<br />
|- <br />
| fastText, h = 200 <br />
| 40.7 <br />
| 10m34 <br />
| 1m29<br />
|-<br />
| fastText, h = 200, bigram <br />
| 45.1 <br />
| 13m38 <br />
| 1m37<br />
|}<br />
<br />
<br />
<br />
The above table presents a comparison of our fastText model to other baselines. The "Precision-at-one" (Prec@1) metric reports the proportion that highest ranking tag the the model predicts is in fact one of the real tags for an item.<br />
<br />
<br />
We can see the frequency baseline has the lowest accuracy. On running fastText for 5 epochs and we compare it to Tagspace results. fastText with bigrams is also used and models are tested with 50 and 200 hidden layers. With 50 hidden layers, both fastText (without bigrams) and Tagspace performed similarly while fastText with bigrams performed better at 50 hidden layers than Tagspace did at 200. At 200 hidden layers, fastText performs better still and with bigrams, accuracy is improved. <br />
<br />
<br />
Finally, both the train and test times of our model performed significantly better than Tagspace which must compute the scores for each class. fastText on the other hand, has a fast inference which gives it an advantage for a large number of classes. In particular, test times for fastText are significantly faster than that of Tagspace.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
<br />
=== fastText and baseline models ===<br />
<br />
The performance of fastText was compared to several baselines related to text classification. While this paper focused on the ways fastText outperformed these classifier, the baseline classifiers were not always evaluated at their optimal performance. <br />
<br />
* '''Tagspace''' was used as a comparison classifier in a tag prediction problem. However, the goal of this model is to predict hashtags as semantic embedding rather than image tags as in the experiment. Additionally, only a linear version of Tagspace was used as a comparison.<br />
<br />
== Sources ==<br />
<br />
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of Tricks for Efficient Text Classification. ''Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers''. doi:10.18653/v1/e17-2068<br />
<br />
Weston, J., Chopra, S., & Adams, K. (2014). #TagSpace: Semantic Embeddings from Hashtags. ''Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)''. doi:10.3115/v1/d14-1194<br />
<br />
== Further Reading ==<br />
<br />
* [https://fasttext.cc/ fastText official website] - Resources for using fastText<br />
<br />
* [https://research.fb.com/fasttext/ Facebook Research: fastText] - Overview of fastText released by Facebook Research.<br />
<br />
* [https://github.com/facebookresearch/fastText fastText GitHub] <br />
<br />
* [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441w18/A_New_Method_of_Region_Embedding_for_Text_Classification A New Method of Region Embedding for Text Classification (Summary)] - Paper summary describing a method of preserving local structure information with small text regions for text classification tasks. <br />
<br />
* [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441w18/Convolutional_Neural_Networks_for_Sentence_Classification Convolution Neural Networks for Sentence Classification (Summary)] - Paper summary describing applying four variations of Convolutional Neural Networks to several NLP tasks such as sentiment analysis, customer review prediction, movie reviews, and more.</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35298Bag of Tricks for Efficient Text Classification2018-03-22T21:26:40Z<p>Cs3yang: /* Further Reading */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
<br />
=== Natural-Language Processing and Text Classification ===<br />
[https://en.wikipedia.org/wiki/Natural-language_processing Natural Language Processing] (NLP) is concerned with being able to process large amounts of natural language data, involving speech recognition, natural language understanding, and natural-language generation. Text understanding involves being able to understand the explicit or implicit meaning of elements of text such as words, phrases, sentences, and paragraphs, and making inferences about these properties of texts (Norvig, 1987). One of the main topics in NLP is text classification, which is assigning predefined categories to free-text documents, with research ranging in this field from designing the best features to choosing the best machine learning classifiers (Zhang et al. 2016). Traditionally, techniques for text classification are based on simple statistics on words that use linear classifiers such as Bag of Words and N-grams. <br />
<br />
With the advancement of deep learning and the availability of large data sets, methods of handling text understanding using deep learning techniques have become popular in recent years. These deep learning models have been shown to significantly perform better than these traditional models in several studies. The following are the deep learning models that are compared to fastText in the Experiment Section:<br />
<br />
=== Char-CNN ===<br />
<br />
=== Char-CRNN ===<br />
<br />
A chactacter-level convolutional neural network (CRNN) is a a convolutional neural network (CNN) with a recurrent layer on top. A recurrent layer can efficiently capture long-term dependencies, therefore a model with only a very small number of convolutional layers is needed.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. <br />
<br />
[[File:xiao and cho 2016 pic.png]]<br />
Graphical illustration of (a) a CNN and (b) proposed CRNN for character-level document classification <br />
<br />
Source: [Xiao and Cho2016] Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv: 1602.00367.<br />
<br />
=== VDCNN ===<br />
<br />
=== Tagspace ===<br />
<br />
[http://emnlp2014.org/papers/pdf/EMNLP2014194.pdf Tagspace] is a convolution neural network which aims to predict hashtags in social network posts. In this context, hashtags are used diversely, as identifiers, sentiments, topic annotations, and more. Tagspace uses "embeddings" which are vector representations of text which are then combined with some function which produces a point on the embedding space. Because Tagspace also captures the semantic context of hashtags, it is a strong model in terms of NLP learning. Tagspace was also applied to a document recommendation problem where the next item a user will interact with was predicted based on their previous history.<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of <math>y_n</math> as each possible class and outputs probability estimates for each class. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single <math>y_n</math> to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for <math>y_n</math>. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed deeper into the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path for more frequently labelled classes. Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>. <br />
<br />
A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf </i><br />
<br />
In the below figure, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
[[File:figure2.jpg|Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
<br />
Note that during the training process, vector representations for inner nodes, <math>v_{n_i}</math>, are updated by introducing an error function, <math>Err</math> and deriving the error function to obtain the following:<br />
<div style="text-align: center;"> <math> \frac{\partial Err}{\partial v_{n_i}^{'}}=\frac{\partial Err}{\partial v_{n_i}^{'}h} \cdot \frac{\partial v_{n_i}^{'}h }{\partial v_{n_i}^{'}} </math> <br></div><br />
<br />
We can use this to update the vector values with the following:<br />
<br />
<div style="text-align: center;"> <math> v_{n_i}'(new) = v_{n_i}'(old) - &eta; \frac{\partial Err}{\partial v_{n_i}'} </math><br></div><br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here, the spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
{| class="wikitable"<br />
|-<br />
! Input<br />
! Prediction<br />
! Tags<br />
|-<br />
| taiyoucon 2011 digitals: individuals digital photos from the anime convention taiyoucon 2011 in mesa, arizona. if you know the model and/or the character, please comment.<br />
| #cosplay<br />
| #24mm #anime #animeconvention #arizona #canon #con #convention #cos '''#cosplay''' #costume #mesa #play #taiyou #taiyoucon<br />
|-<br />
| 2012 twin cities pride 2012 twin cities pride parade<br />
| #minneapolis <br />
| #2012twincitiesprideparade '''#minneapolis''' #mn #usa<br />
|-<br />
| beagle enjoys the snowfall<br />
| #snow<br />
| #2007 #beagle #hillsboro #january #maddison #maddy #oregon '''#snow'''<br />
|-<br />
| christmas <br />
| #christmas <br />
| #cameraphone #mobile<br />
|-<br />
| euclid avenue <br />
| #newyorkcity <br />
| #cleveland #euclidavenue<br />
|}<br />
<br />
<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags. The precision of the model on the test set is reported at 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]). (Need to figure out what this means, maybe move this line later on).<br />
<br />
==== Baselines====<br />
A frequency-based approach which simply predicts the most frequent tag is used as a baseline.<br />
<br />
The model we compare fastText to in tag prediction is [http://www.aclweb.org/anthology/D14-1194 Tagspace]. It is similar to our model but is based on the [http://www.thespermwhale.com/jaseweston/papers/wsabie-ijcai.pdf WSABIE] which predicts tags based on both images and their text annotations. The Tagspace model is a convolution neural network, which predicts hashtags in semantic contexts of social network posts. For faster and better comparable performance, the experiment uses a linear version of this model. Tagspace also predicts multiple hashtags so this experiment does not fully compare the capabilities of the Tagspace model.<br />
<br />
==== Results ====<br />
<br />
{| class="wikitable"<br />
|-<br />
! Model <br />
! prec@1<br />
! Running time - Train<br />
! Running time - Test<br />
|- <br />
| Freq. baseline <br />
| 2.2 <br />
| - <br />
| -<br />
|- <br />
| Tagspace, h = 50 <br />
| 30.1 <br />
| 3h8 <br />
| 6h<br />
|- <br />
| Tagspace, h = 200 <br />
| 35.6 <br />
| 5h32 <br />
| 15h<br />
|- <br />
| fastText, h = 50 <br />
| 30.8 <br />
| 6m40 <br />
| 48s<br />
|- <br />
| fastText, h = 50, bigram <br />
| 35.6 <br />
| 7m47 <br />
| 50s<br />
|- <br />
| fastText, h = 200 <br />
| 40.7 <br />
| 10m34 <br />
| 1m29<br />
|-<br />
| fastText, h = 200, bigram <br />
| 45.1 <br />
| 13m38 <br />
| 1m37<br />
|}<br />
<br />
<br />
<br />
The above table presents a comparison of our fastText model to other baselines. The "Precision-at-one" (Prec@1) metric reports the proportion that highest ranking tag the the model predicts is in fact one of the real tags for an item.<br />
<br />
<br />
We can see the frequency baseline has the lowest accuracy. On running fastText for 5 epochs and we compare it to Tagspace results. fastText with bigrams is also used and models are tested with 50 and 200 hidden layers. With 50 hidden layers, both fastText (without bigrams) and Tagspace performed similarly while fastText with bigrams performed better at 50 hidden layers than Tagspace did at 200. At 200 hidden layers, fastText performs better still and with bigrams, accuracy is improved. <br />
<br />
<br />
Finally, both the train and test times of our model performed significantly better than Tagspace which must compute the scores for each class. fastText on the other hand, has a fast inference which gives it an advantage for a large number of classes. In particular, test times for fastText are significantly faster than that of Tagspace.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
<br />
=== fastText and baseline models ===<br />
<br />
The performance of fastText was compared to several baselines related to text classification. While this paper focused on the ways fastText outperformed these classifier, the baseline classifiers were not always evaluated at their optimal performance. <br />
<br />
* '''Tagspace''' was used as a comparison classifier in a tag prediction problem. However, the goal of this model is to predict hashtags as semantic embedding rather than image tags as in the experiment. Additionally, only a linear version of Tagspace was used as a comparison.<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* [https://fasttext.cc/ fastText official website] - Resources for using fastText<br />
<br />
* [https://research.fb.com/fasttext/ Facebook Research: fastText] - Overview of fastText released by Facebook Research.<br />
<br />
* [https://github.com/facebookresearch/fastText fastText GitHub] <br />
<br />
* [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441w18/A_New_Method_of_Region_Embedding_for_Text_Classification A New Method of Region Embedding for Text Classification (Summary)] - Paper summary describing a method of preserving local structure information with small text regions for text classification tasks. <br />
<br />
* [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441w18/Convolutional_Neural_Networks_for_Sentence_Classification Convolution Neural Networks for Sentence Classification (Summary)] - Paper summary describing applying four variations of Convolutional Neural Networks to several NLP tasks such as sentiment analysis, customer review prediction, movie reviews, and more.</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35283Bag of Tricks for Efficient Text Classification2018-03-22T21:07:02Z<p>Cs3yang: /* Tagspace */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
<br />
=== Natural-Language Processing and Text Classification ===<br />
[https://en.wikipedia.org/wiki/Natural-language_processing Natural Language Processing] (NLP) is concerned with being able to process large amounts of natural language data, involving speech recognition, natural language understanding, and natural-language generation. Text understanding involves being able to understand the explicit or implicit meaning of elements of text such as words, phrases, sentences, and paragraphs, and making inferences about these properties of texts (Norvig, 1987). One of the main topics in NLP is text classification, which is assigning predefined categories to free-text documents, with research ranging in this field from designing the best features to choosing the best machine learning classifiers (Zhang et al. 2016). Traditionally, techniques for text classification are based on simple statistics on words that use linear classifiers such as Bag of Words and N-grams. <br />
<br />
With the advancement of deep learning and the availability of large data sets, methods of handling text understanding using deep learning techniques have become popular in recent years. These deep learning models have been shown to significantly perform better than these traditional models in several studies. The following are the deep learning models that are compared to fastText in the Experiment Section:<br />
<br />
=== Char-CNN ===<br />
<br />
<br />
=== Char-CRNN ===<br />
<br />
A chactacter-level convolutional neural network (CRNN) is a a convolutional neural network (CNN) with a recurrent layer on top. A recurrent layer can efficiently capture long-term dependencies, therefore a model with only a very small number of convolutional layers is needed.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. <br />
<br />
[[File:xiao and cho 2016 pic.png]]<br />
Graphical illustration of (a) a CNN and (b) proposed CRNN for character-level document classification <br />
<br />
Source: [Xiao and Cho2016] Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv: 1602.00367.<br />
<br />
=== VDCNN ===<br />
<br />
=== Tagspace ===<br />
<br />
[http://emnlp2014.org/papers/pdf/EMNLP2014194.pdf Tagspace] is a convolution neural network which aims to predict hashtags in social network posts. In this context, hashtags are used diversely, as identifiers, sentiments, topic annotations, and more. Tagspace uses "embeddings" which are vector representations of text which are then combined with some function which produces a point on the embedding space. Because Tagspace also captures the semantic context of hashtags, it is a strong model in terms of NLP learning. Tagspace was also applied to a document recommendation problem where the next item a user will interact with was predicted based on their previous history.<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of <math>y_n</math> as each possible class and outputs probability estimates for each class. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single <math>y_n</math> to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for <math>y_n</math>. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed deeper into the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path for more frequently labelled classes. Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>. <br />
<br />
A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf </i><br />
<br />
In the below figure, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
[[File:figure2.jpg|Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
<br />
Note that during the training process, vector representations for inner nodes, <math>v_{n_i}</math>, are updated by introducing an error function, <math>Err</math> and deriving the error function to obtain the following:<br />
<div style="text-align: center;"> <math> \frac{\partial Err}{\partial v_{n_i}^{'}}=\frac{\partial Err}{\partial v_{n_i}^{'}h} \cdot \frac{\partial v_{n_i}^{'}h }{\partial v_{n_i}^{'}} </math> <br></div><br />
<br />
We can use this to update the vector values with the following:<br />
<br />
<div style="text-align: center;"> <math> v_{n_i}'(new) = v_{n_i}'(old) - &eta; \frac{\partial Err}{\partial v_{n_i}'} </math><br></div><br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here, the spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
{| class="wikitable"<br />
|-<br />
! Input<br />
! Prediction<br />
! Tags<br />
|-<br />
| taiyoucon 2011 digitals: individuals digital photos from the anime convention taiyoucon 2011 in mesa, arizona. if you know the model and/or the character, please comment.<br />
| #cosplay<br />
| #24mm #anime #animeconvention #arizona #canon #con #convention #cos '''#cosplay''' #costume #mesa #play #taiyou #taiyoucon<br />
|-<br />
| 2012 twin cities pride 2012 twin cities pride parade<br />
| #minneapolis <br />
| #2012twincitiesprideparade '''#minneapolis''' #mn #usa<br />
|-<br />
| beagle enjoys the snowfall<br />
| #snow<br />
| #2007 #beagle #hillsboro #january #maddison #maddy #oregon '''#snow'''<br />
|-<br />
| christmas <br />
| #christmas <br />
| #cameraphone #mobile<br />
|-<br />
| euclid avenue <br />
| #newyorkcity <br />
| #cleveland #euclidavenue<br />
|}<br />
<br />
<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags. The precision of the model on the test set is reported at 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]). (Need to figure out what this means, maybe move this line later on).<br />
<br />
==== Baselines====<br />
A frequency-based approach which simply predicts the most frequent tag is used as a baseline.<br />
<br />
The model we compare fastText to in tag prediction is [http://www.aclweb.org/anthology/D14-1194 Tagspace]. It is similar to our model but is based on the [http://www.thespermwhale.com/jaseweston/papers/wsabie-ijcai.pdf WSABIE] which predicts tags based on both images and their text annotations. The Tagspace model is a convolution neural network, which predicts hashtags in semantic contexts of social network posts. For faster and better comparable performance, the experiment uses a linear version of this model. Tagspace also predicts multiple hashtags so this experiment does not fully compare the capabilities of the Tagspace model.<br />
<br />
==== Results ====<br />
<br />
{| class="wikitable"<br />
|-<br />
! Model <br />
! prec@1<br />
! Running time - Train<br />
! Running time - Test<br />
|- <br />
| Freq. baseline <br />
| 2.2 <br />
| - <br />
| -<br />
|- <br />
| Tagspace, h = 50 <br />
| 30.1 <br />
| 3h8 <br />
| 6h<br />
|- <br />
| Tagspace, h = 200 <br />
| 35.6 <br />
| 5h32 <br />
| 15h<br />
|- <br />
| fastText, h = 50 <br />
| 30.8 <br />
| 6m40 <br />
| 48s<br />
|- <br />
| fastText, h = 50, bigram <br />
| 35.6 <br />
| 7m47 <br />
| 50s<br />
|- <br />
| fastText, h = 200 <br />
| 40.7 <br />
| 10m34 <br />
| 1m29<br />
|-<br />
| fastText, h = 200, bigram <br />
| 45.1 <br />
| 13m38 <br />
| 1m37<br />
|}<br />
<br />
<br />
<br />
The above table presents a comparison of our fastText model to other baselines. The "Precision-at-one" (Prec@1) metric reports the proportion that highest ranking tag the the model predicts is in fact one of the real tags for an item.<br />
<br />
<br />
We can see the frequency baseline has the lowest accuracy. On running fastText for 5 epochs and we compare it to Tagspace results. fastText with bigrams is also used and models are tested with 50 and 200 hidden layers. With 50 hidden layers, both fastText (without bigrams) and Tagspace performed similarly while fastText with bigrams performed better at 50 hidden layers than Tagspace did at 200. At 200 hidden layers, fastText performs better still and with bigrams, accuracy is improved. <br />
<br />
<br />
Finally, both the train and test times of our model performed significantly better than Tagspace which must compute the scores for each class. fastText on the other hand, has a fast inference which gives it an advantage for a large number of classes. In particular, test times for fastText are significantly faster than that of Tagspace.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
<br />
=== fastText and baseline models ===<br />
<br />
The performance of fastText was compared to several baselines related to text classification. While this paper focused on the ways fastText outperformed these classifier, the baseline classifiers were not always evaluated at their optimal performance. <br />
<br />
* '''Tagspace''' was used as a comparison classifier in a tag prediction problem. However, the goal of this model is to predict hashtags as semantic embedding rather than image tags as in the experiment. Additionally, only a linear version of Tagspace was used as a comparison.<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35281Bag of Tricks for Efficient Text Classification2018-03-22T20:49:25Z<p>Cs3yang: /* fastText and baseline models */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
<br />
=== Natural-Language Processing and Text Classification ===<br />
[https://en.wikipedia.org/wiki/Natural-language_processing Natural Language Processing] (NLP) is concerned with being able to process large amounts of natural language data, involving speech recognition, natural language understanding, and natural-language generation. Text understanding involves being able to understand the explicit or implicit meaning of elements of text such as words, phrases, sentences, and paragraphs, and making inferences about these properties of texts (Norvig, 1987). One of the main topics in NLP is text classification, which is assigning predefined categories to free-text documents, with research ranging in this field from designing the best features to choosing the best machine learning classifiers (Zhang et al. 2016). Traditionally, techniques for text classification are based on simple statistics on words that use linear classifiers such as Bag of Words and N-grams. <br />
<br />
With the advancement of deep learning and the availability of large data sets, methods of handling text understanding using deep learning techniques have become popular in recent years. These deep learning models have been shown to significantly perform better than these traditional models in several studies. The following are the deep learning models that are compared to fastText in the Experiment Section:<br />
<br />
=== Char-CNN ===<br />
<br />
<br />
=== Char-CRNN ===<br />
<br />
A chactacter-level convolutional neural network (CRNN) is a a convolutional neural network (CNN) with a recurrent layer on top. A recurrent layer can efficiently capture long-term dependencies, therefore a model with only a very small number of convolutional layers is needed.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. <br />
<br />
[[File:xiao and cho 2016 pic.png]]<br />
Graphical illustration of (a) a CNN and (b) proposed CRNN for character-level document classification <br />
<br />
Source: [Xiao and Cho2016] Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv: 1602.00367.<br />
<br />
=== VDCNN ===<br />
<br />
=== Tagspace ===<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of <math>y_n</math> as each possible class and outputs probability estimates for each class. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single <math>y_n</math> to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for <math>y_n</math>. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed deeper into the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path for more frequently labelled classes. Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>. <br />
<br />
A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf </i><br />
<br />
In the below figure, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
[[File:figure2.jpg|Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
<br />
Note that during the training process, vector representations for inner nodes, <math>v_{n_i}</math>, are updated by introducing an error function, <math>Err</math> and deriving the error function to obtain the following:<br />
<div style="text-align: center;"> <math> \frac{\partial Err}{\partial v_{n_i}^{'}}=\frac{\partial Err}{\partial v_{n_i}^{'}h} \cdot \frac{\partial v_{n_i}^{'}h }{\partial v_{n_i}^{'}} </math> <br></div><br />
<br />
We can use this to update the vector values with the following:<br />
<br />
<div style="text-align: center;"> <math> v_{n_i}'(new) = v_{n_i}'(old) - &eta; \frac{\partial Err}{\partial v_{n_i}'} </math><br></div><br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here, the spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
{| class="wikitable"<br />
|-<br />
! Input<br />
! Prediction<br />
! Tags<br />
|-<br />
| taiyoucon 2011 digitals: individuals digital photos from the anime convention taiyoucon 2011 in mesa, arizona. if you know the model and/or the character, please comment.<br />
| #cosplay<br />
| #24mm #anime #animeconvention #arizona #canon #con #convention #cos '''#cosplay''' #costume #mesa #play #taiyou #taiyoucon<br />
|-<br />
| 2012 twin cities pride 2012 twin cities pride parade<br />
| #minneapolis <br />
| #2012twincitiesprideparade '''#minneapolis''' #mn #usa<br />
|-<br />
| beagle enjoys the snowfall<br />
| #snow<br />
| #2007 #beagle #hillsboro #january #maddison #maddy #oregon '''#snow'''<br />
|-<br />
| christmas <br />
| #christmas <br />
| #cameraphone #mobile<br />
|-<br />
| euclid avenue <br />
| #newyorkcity <br />
| #cleveland #euclidavenue<br />
|}<br />
<br />
<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags. The precision of the model on the test set is reported at 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]). (Need to figure out what this means, maybe move this line later on).<br />
<br />
==== Baselines====<br />
A frequency-based approach which simply predicts the most frequent tag is used as a baseline.<br />
<br />
The model we compare fastText to in tag prediction is [http://www.aclweb.org/anthology/D14-1194 Tagspace]. It is similar to our model but is based on the [http://www.thespermwhale.com/jaseweston/papers/wsabie-ijcai.pdf WSABIE] which predicts tags based on both images and their text annotations. The Tagspace model is a convolution neural network, which predicts hashtags in semantic contexts of social network posts. For faster and better comparable performance, the experiment uses a linear version of this model. Tagspace also predicts multiple hashtags so this experiment does not fully compare the capabilities of the Tagspace model.<br />
<br />
==== Results ====<br />
<br />
{| class="wikitable"<br />
|-<br />
! Model <br />
! prec@1<br />
! Running time - Train<br />
! Running time - Test<br />
|- <br />
| Freq. baseline <br />
| 2.2 <br />
| - <br />
| -<br />
|- <br />
| Tagspace, h = 50 <br />
| 30.1 <br />
| 3h8 <br />
| 6h<br />
|- <br />
| Tagspace, h = 200 <br />
| 35.6 <br />
| 5h32 <br />
| 15h<br />
|- <br />
| fastText, h = 50 <br />
| 30.8 <br />
| 6m40 <br />
| 48s<br />
|- <br />
| fastText, h = 50, bigram <br />
| 35.6 <br />
| 7m47 <br />
| 50s<br />
|- <br />
| fastText, h = 200 <br />
| 40.7 <br />
| 10m34 <br />
| 1m29<br />
|-<br />
| fastText, h = 200, bigram <br />
| 45.1 <br />
| 13m38 <br />
| 1m37<br />
|}<br />
<br />
<br />
<br />
The above table presents a comparison of our fastText model to other baselines. The "Precision-at-one" (Prec@1) metric reports the proportion that highest ranking tag the the model predicts is in fact one of the real tags for an item.<br />
<br />
<br />
We can see the frequency baseline has the lowest accuracy. On running fastText for 5 epochs and we compare it to Tagspace results. fastText with bigrams is also used and models are tested with 50 and 200 hidden layers. With 50 hidden layers, both fastText (without bigrams) and Tagspace performed similarly while fastText with bigrams performed better at 50 hidden layers than Tagspace did at 200. At 200 hidden layers, fastText performs better still and with bigrams, accuracy is improved. <br />
<br />
<br />
Finally, both the train and test times of our model performed significantly better than Tagspace which must compute the scores for each class. fastText on the other hand, has a fast inference which gives it an advantage for a large number of classes. In particular, test times for fastText are significantly faster than that of Tagspace.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
<br />
=== fastText and baseline models ===<br />
<br />
The performance of fastText was compared to several baselines related to text classification. While this paper focused on the ways fastText outperformed these classifier, the baseline classifiers were not always evaluated at their optimal performance. <br />
<br />
* '''Tagspace''' was used as a comparison classifier in a tag prediction problem. However, the goal of this model is to predict hashtags as semantic embedding rather than image tags as in the experiment. Additionally, only a linear version of Tagspace was used as a comparison.<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35280Bag of Tricks for Efficient Text Classification2018-03-22T20:41:19Z<p>Cs3yang: /* Background */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
<br />
=== Natural-Language Processing and Text Classification ===<br />
[https://en.wikipedia.org/wiki/Natural-language_processing Natural Language Processing] (NLP) is concerned with being able to process large amounts of natural language data, involving speech recognition, natural language understanding, and natural-language generation. Text understanding involves being able to understand the explicit or implicit meaning of elements of text such as words, phrases, sentences, and paragraphs, and making inferences about these properties of texts (Norvig, 1987). One of the main topics in NLP is text classification, which is assigning predefined categories to free-text documents, with research ranging in this field from designing the best features to choosing the best machine learning classifiers (Zhang et al. 2016). Traditionally, techniques for text classification are based on simple statistics on words that use linear classifiers such as Bag of Words and N-grams. <br />
<br />
With the advancement of deep learning and the availability of large data sets, methods of handling text understanding using deep learning techniques have become popular in recent years. These deep learning models have been shown to significantly perform better than these traditional models in several studies. The following are the deep learning models that are compared to fastText in the Experiment Section:<br />
<br />
=== Char-CNN ===<br />
<br />
<br />
=== Char-CRNN ===<br />
<br />
A chactacter-level convolutional neural network (CRNN) is a a convolutional neural network (CNN) with a recurrent layer on top. A recurrent layer can efficiently capture long-term dependencies, therefore a model with only a very small number of convolutional layers is needed.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. <br />
<br />
[[File:xiao and cho 2016 pic.png]]<br />
Graphical illustration of (a) a CNN and (b) proposed CRNN for character-level document classification <br />
<br />
Source: [Xiao and Cho2016] Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv: 1602.00367.<br />
<br />
=== VDCNN ===<br />
<br />
=== Tagspace ===<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of <math>y_n</math> as each possible class and outputs probability estimates for each class. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single <math>y_n</math> to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for <math>y_n</math>. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed deeper into the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path for more frequently labelled classes. Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>. <br />
<br />
A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf </i><br />
<br />
In the below figure, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
[[File:figure2.jpg|Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
<br />
Note that during the training process, vector representations for inner nodes, <math>v_{n_i}</math>, are updated by introducing an error function, <math>Err</math> and deriving the error function to obtain the following:<br />
<div style="text-align: center;"> <math> \frac{\partial Err}{\partial v_{n_i}^{'}}=\frac{\partial Err}{\partial v_{n_i}^{'}h} \cdot \frac{\partial v_{n_i}^{'}h }{\partial v_{n_i}^{'}} </math> <br></div><br />
<br />
We can use this to update the vector values with the following:<br />
<br />
<div style="text-align: center;"> <math> v_{n_i}'(new) = v_{n_i}'(old) - &eta; \frac{\partial Err}{\partial v_{n_i}'} </math><br></div><br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here, the spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
{| class="wikitable"<br />
|-<br />
! Input<br />
! Prediction<br />
! Tags<br />
|-<br />
| taiyoucon 2011 digitals: individuals digital photos from the anime convention taiyoucon 2011 in mesa, arizona. if you know the model and/or the character, please comment.<br />
| #cosplay<br />
| #24mm #anime #animeconvention #arizona #canon #con #convention #cos '''#cosplay''' #costume #mesa #play #taiyou #taiyoucon<br />
|-<br />
| 2012 twin cities pride 2012 twin cities pride parade<br />
| #minneapolis <br />
| #2012twincitiesprideparade '''#minneapolis''' #mn #usa<br />
|-<br />
| beagle enjoys the snowfall<br />
| #snow<br />
| #2007 #beagle #hillsboro #january #maddison #maddy #oregon '''#snow'''<br />
|-<br />
| christmas <br />
| #christmas <br />
| #cameraphone #mobile<br />
|-<br />
| euclid avenue <br />
| #newyorkcity <br />
| #cleveland #euclidavenue<br />
|}<br />
<br />
<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags. The precision of the model on the test set is reported at 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]). (Need to figure out what this means, maybe move this line later on).<br />
<br />
==== Baselines====<br />
A frequency-based approach which simply predicts the most frequent tag is used as a baseline.<br />
<br />
The model we compare fastText to in tag prediction is [http://www.aclweb.org/anthology/D14-1194 Tagspace]. It is similar to our model but is based on the [http://www.thespermwhale.com/jaseweston/papers/wsabie-ijcai.pdf WSABIE] which predicts tags based on both images and their text annotations. The Tagspace model is a convolution neural network, which predicts hashtags in semantic contexts of social network posts. For faster and better comparable performance, the experiment uses a linear version of this model. Tagspace also predicts multiple hashtags so this experiment does not fully compare the capabilities of the Tagspace model.<br />
<br />
==== Results ====<br />
<br />
{| class="wikitable"<br />
|-<br />
! Model <br />
! prec@1<br />
! Running time - Train<br />
! Running time - Test<br />
|- <br />
| Freq. baseline <br />
| 2.2 <br />
| - <br />
| -<br />
|- <br />
| Tagspace, h = 50 <br />
| 30.1 <br />
| 3h8 <br />
| 6h<br />
|- <br />
| Tagspace, h = 200 <br />
| 35.6 <br />
| 5h32 <br />
| 15h<br />
|- <br />
| fastText, h = 50 <br />
| 30.8 <br />
| 6m40 <br />
| 48s<br />
|- <br />
| fastText, h = 50, bigram <br />
| 35.6 <br />
| 7m47 <br />
| 50s<br />
|- <br />
| fastText, h = 200 <br />
| 40.7 <br />
| 10m34 <br />
| 1m29<br />
|-<br />
| fastText, h = 200, bigram <br />
| 45.1 <br />
| 13m38 <br />
| 1m37<br />
|}<br />
<br />
<br />
<br />
The above table presents a comparison of our fastText model to other baselines. The "Precision-at-one" (Prec@1) metric reports the proportion that highest ranking tag the the model predicts is in fact one of the real tags for an item.<br />
<br />
<br />
We can see the frequency baseline has the lowest accuracy. On running fastText for 5 epochs and we compare it to Tagspace results. fastText with bigrams is also used and models are tested with 50 and 200 hidden layers. With 50 hidden layers, both fastText (without bigrams) and Tagspace performed similarly while fastText with bigrams performed better at 50 hidden layers than Tagspace did at 200. At 200 hidden layers, fastText performs better still and with bigrams, accuracy is improved. <br />
<br />
<br />
Finally, both the train and test times of our model performed significantly better than Tagspace which must compute the scores for each class. fastText on the other hand, has a fast inference which gives it an advantage for a large number of classes. In particular, test times for fastText are significantly faster than that of Tagspace.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
<br />
=== fastText and baseline models ===<br />
<br />
The performance of fastText was compared to several baselines related to text classification. While this paper focused on the ways fastText outperformed these classifier, the baseline classifiers were not always evaluated at their optimal performance. <br />
<br />
* '''Tagspace''' was used as a comparison classifier in a tag prediction problem. However, the goal of this model is to predict hashtags as semantic embedding rather than image tags as in the experiment.<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35278Bag of Tricks for Efficient Text Classification2018-03-22T20:37:59Z<p>Cs3yang: /* Commentary and Criticism */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
<br />
=== Natural-Language Processing and Text Classification ===<br />
[https://en.wikipedia.org/wiki/Natural-language_processing Natural Language Processing] (NLP) is concerned with being able to process large amounts of natural language data, involving speech recognition, natural language understanding, and natural-language generation. Text understanding involves being able to understand the explicit or implicit meaning of elements of text such as words, phrases, sentences, and paragraphs, and making inferences about these properties of texts (Norvig, 1987). One of the main topics in NLP is text classification, which is assigning predefined categories to free-text documents, with research ranging in this field from designing the best features to choosing the best machine learning classifiers (Zhang et al. 2016). Traditionally, techniques for text classification are based on simple statistics on words that use linear classifiers such as Bag of Words and N-grams. <br />
<br />
With the advancement of deep learning and the availability of large data sets, methods of handling text understanding using deep learning techniques have become popular in recent years. These deep learning models have been shown to significantly perform better than these traditional models in several studies. The following are the deep learning models that are compared to fastText in the Experiment Section:<br />
<br />
=== Char-CNN ===<br />
<br />
<br />
=== Char-CRNN ===<br />
<br />
A chactacter-level convolutional neural network (CRNN) is a a convolutional neural network (CNN) with a recurrent layer on top. A recurrent layer can efficiently capture long-term dependencies, therefore a model with only a very small number of convolutional layers is needed.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. <br />
<br />
[[File:xiao and cho 2016 pic.png]]<br />
Graphical illustration of (a) a CNN and (b) proposed CRNN for character-level document classification <br />
<br />
Source: [Xiao and Cho2016] Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv: 1602.00367.<br />
<br />
=== VDCNN ===<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of <math>y_n</math> as each possible class and outputs probability estimates for each class. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single <math>y_n</math> to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for <math>y_n</math>. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed deeper into the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path for more frequently labelled classes. Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>. <br />
<br />
A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf </i><br />
<br />
In the below figure, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
[[File:figure2.jpg|Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
<br />
Note that during the training process, vector representations for inner nodes, <math>v_{n_i}</math>, are updated by introducing an error function, <math>Err</math> and deriving the error function to obtain the following:<br />
<div style="text-align: center;"> <math> \frac{\partial Err}{\partial v_{n_i}^{'}}=\frac{\partial Err}{\partial v_{n_i}^{'}h} \cdot \frac{\partial v_{n_i}^{'}h }{\partial v_{n_i}^{'}} </math> <br></div><br />
<br />
We can use this to update the vector values with the following:<br />
<br />
<div style="text-align: center;"> <math> v_{n_i}'(new) = v_{n_i}'(old) - &eta; \frac{\partial Err}{\partial v_{n_i}'} </math><br></div><br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here, the spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
{| class="wikitable"<br />
|-<br />
! Input<br />
! Prediction<br />
! Tags<br />
|-<br />
| taiyoucon 2011 digitals: individuals digital photos from the anime convention taiyoucon 2011 in mesa, arizona. if you know the model and/or the character, please comment.<br />
| #cosplay<br />
| #24mm #anime #animeconvention #arizona #canon #con #convention #cos '''#cosplay''' #costume #mesa #play #taiyou #taiyoucon<br />
|-<br />
| 2012 twin cities pride 2012 twin cities pride parade<br />
| #minneapolis <br />
| #2012twincitiesprideparade '''#minneapolis''' #mn #usa<br />
|-<br />
| beagle enjoys the snowfall<br />
| #snow<br />
| #2007 #beagle #hillsboro #january #maddison #maddy #oregon '''#snow'''<br />
|-<br />
| christmas <br />
| #christmas <br />
| #cameraphone #mobile<br />
|-<br />
| euclid avenue <br />
| #newyorkcity <br />
| #cleveland #euclidavenue<br />
|}<br />
<br />
<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags. The precision of the model on the test set is reported at 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]). (Need to figure out what this means, maybe move this line later on).<br />
<br />
==== Baselines====<br />
A frequency-based approach which simply predicts the most frequent tag is used as a baseline.<br />
<br />
The model we compare fastText to in tag prediction is [http://www.aclweb.org/anthology/D14-1194 Tagspace]. It is similar to our model but is based on the [http://www.thespermwhale.com/jaseweston/papers/wsabie-ijcai.pdf WSABIE] which predicts tags based on both images and their text annotations. The Tagspace model is a convolution neural network, which predicts hashtags in semantic contexts of social network posts. For faster and better comparable performance, the experiment uses a linear version of this model. Tagspace also predicts multiple hashtags so this experiment does not fully compare the capabilities of the Tagspace model.<br />
<br />
==== Results ====<br />
<br />
{| class="wikitable"<br />
|-<br />
! Model <br />
! prec@1<br />
! Running time - Train<br />
! Running time - Test<br />
|- <br />
| Freq. baseline <br />
| 2.2 <br />
| - <br />
| -<br />
|- <br />
| Tagspace, h = 50 <br />
| 30.1 <br />
| 3h8 <br />
| 6h<br />
|- <br />
| Tagspace, h = 200 <br />
| 35.6 <br />
| 5h32 <br />
| 15h<br />
|- <br />
| fastText, h = 50 <br />
| 30.8 <br />
| 6m40 <br />
| 48s<br />
|- <br />
| fastText, h = 50, bigram <br />
| 35.6 <br />
| 7m47 <br />
| 50s<br />
|- <br />
| fastText, h = 200 <br />
| 40.7 <br />
| 10m34 <br />
| 1m29<br />
|-<br />
| fastText, h = 200, bigram <br />
| 45.1 <br />
| 13m38 <br />
| 1m37<br />
|}<br />
<br />
<br />
<br />
The above table presents a comparison of our fastText model to other baselines. The "Precision-at-one" (Prec@1) metric reports the proportion that highest ranking tag the the model predicts is in fact one of the real tags for an item.<br />
<br />
<br />
We can see the frequency baseline has the lowest accuracy. On running fastText for 5 epochs and we compare it to Tagspace results. fastText with bigrams is also used and models are tested with 50 and 200 hidden layers. With 50 hidden layers, both fastText (without bigrams) and Tagspace performed similarly while fastText with bigrams performed better at 50 hidden layers than Tagspace did at 200. At 200 hidden layers, fastText performs better still and with bigrams, accuracy is improved. <br />
<br />
<br />
Finally, both the train and test times of our model performed significantly better than Tagspace which must compute the scores for each class. fastText on the other hand, has a fast inference which gives it an advantage for a large number of classes. In particular, test times for fastText are significantly faster than that of Tagspace.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
<br />
=== fastText and baseline models ===<br />
<br />
The performance of fastText was compared to several baselines related to text classification. While this paper focused on the ways fastText outperformed these classifier, the baseline classifiers were not always evaluated at their optimal performance. <br />
<br />
* '''Tagspace''' was used as a comparison classifier in a tag prediction problem. However, the goal of this model is to predict hashtags as semantic embedding rather than image tags as in the experiment.<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35277Bag of Tricks for Efficient Text Classification2018-03-22T20:27:40Z<p>Cs3yang: /* Results */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
<br />
=== Natural-Language Processing and Text Classification ===<br />
[https://en.wikipedia.org/wiki/Natural-language_processing Natural Language Processing] (NLP) is concerned with being able to process large amounts of natural language data, involving speech recognition, natural language understanding, and natural-language generation. Text understanding involves being able to understand the explicit or implicit meaning of elements of text such as words, phrases, sentences, and paragraphs, and making inferences about these properties of texts (Norvig, 1987). One of the main topics in NLP is text classification, which is assigning predefined categories to free-text documents, with research ranging in this field from designing the best features to choosing the best machine learning classifiers (Zhang et al. 2016). Traditionally, techniques for text classification are based on simple statistics on words that use linear classifiers such as Bag of Words and N-grams. <br />
<br />
With the advancement of deep learning and the availability of large data sets, methods of handling text understanding using deep learning techniques have become popular in recent years. These deep learning models have been shown to significantly perform better than these traditional models in several studies. The following are the deep learning models that are compared to fastText in the Experiment Section:<br />
<br />
=== Char-CNN ===<br />
<br />
<br />
=== Char-CRNN ===<br />
<br />
A chactacter-level convolutional neural network (CRNN) is a a convolutional neural network (CNN) with a recurrent layer on top. A recurrent layer can efficiently capture long-term dependencies, therefore a model with only a very small number of convolutional layers is needed.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. <br />
<br />
[[File:xiao and cho 2016 pic.png]]<br />
Graphical illustration of (a) a CNN and (b) proposed CRNN for character-level document classification <br />
<br />
Source: [Xiao and Cho2016] Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv: 1602.00367.<br />
<br />
=== VDCNN ===<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of <math>y_n</math> as each possible class and outputs probability estimates for each class. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single <math>y_n</math> to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for <math>y_n</math>. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed deeper into the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path for more frequently labelled classes. Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>. <br />
<br />
A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf </i><br />
<br />
In the below figure, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
[[File:figure2.jpg|Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
<br />
Note that during the training process, vector representations for inner nodes, <math>v_{n_i}</math>, are updated by introducing an error function, <math>Err</math> and deriving the error function to obtain the following:<br />
<div style="text-align: center;"> <math> \frac{\partial Err}{\partial v_{n_i}^{'}}=\frac{\partial Err}{\partial v_{n_i}^{'}h} \cdot \frac{\partial v_{n_i}^{'}h }{\partial v_{n_i}^{'}} </math> <br></div><br />
<br />
We can use this to update the vector values with the following:<br />
<br />
<div style="text-align: center;"> <math> v_{n_i}'(new) = v_{n_i}'(old) - &eta; \frac{\partial Err}{\partial v_{n_i}'} </math><br></div><br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here, the spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
{| class="wikitable"<br />
|-<br />
! Input<br />
! Prediction<br />
! Tags<br />
|-<br />
| taiyoucon 2011 digitals: individuals digital photos from the anime convention taiyoucon 2011 in mesa, arizona. if you know the model and/or the character, please comment.<br />
| #cosplay<br />
| #24mm #anime #animeconvention #arizona #canon #con #convention #cos '''#cosplay''' #costume #mesa #play #taiyou #taiyoucon<br />
|-<br />
| 2012 twin cities pride 2012 twin cities pride parade<br />
| #minneapolis <br />
| #2012twincitiesprideparade '''#minneapolis''' #mn #usa<br />
|-<br />
| beagle enjoys the snowfall<br />
| #snow<br />
| #2007 #beagle #hillsboro #january #maddison #maddy #oregon '''#snow'''<br />
|-<br />
| christmas <br />
| #christmas <br />
| #cameraphone #mobile<br />
|-<br />
| euclid avenue <br />
| #newyorkcity <br />
| #cleveland #euclidavenue<br />
|}<br />
<br />
<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags. The precision of the model on the test set is reported at 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]). (Need to figure out what this means, maybe move this line later on).<br />
<br />
==== Baselines====<br />
A frequency-based approach which simply predicts the most frequent tag is used as a baseline.<br />
<br />
The model we compare fastText to in tag prediction is [http://www.aclweb.org/anthology/D14-1194 Tagspace]. It is similar to our model but is based on the [http://www.thespermwhale.com/jaseweston/papers/wsabie-ijcai.pdf WSABIE] which predicts tags based on both images and their text annotations. The Tagspace model is a convolution neural network, which predicts hashtags in semantic contexts of social network posts. For faster and better comparable performance, the experiment uses a linear version of this model. Tagspace also predicts multiple hashtags so this experiment does not fully compare the capabilities of the Tagspace model.<br />
<br />
==== Results ====<br />
<br />
{| class="wikitable"<br />
|-<br />
! Model <br />
! prec@1<br />
! Running time - Train<br />
! Running time - Test<br />
|- <br />
| Freq. baseline <br />
| 2.2 <br />
| - <br />
| -<br />
|- <br />
| Tagspace, h = 50 <br />
| 30.1 <br />
| 3h8 <br />
| 6h<br />
|- <br />
| Tagspace, h = 200 <br />
| 35.6 <br />
| 5h32 <br />
| 15h<br />
|- <br />
| fastText, h = 50 <br />
| 30.8 <br />
| 6m40 <br />
| 48s<br />
|- <br />
| fastText, h = 50, bigram <br />
| 35.6 <br />
| 7m47 <br />
| 50s<br />
|- <br />
| fastText, h = 200 <br />
| 40.7 <br />
| 10m34 <br />
| 1m29<br />
|-<br />
| fastText, h = 200, bigram <br />
| 45.1 <br />
| 13m38 <br />
| 1m37<br />
|}<br />
<br />
<br />
<br />
The above table presents a comparison of our fastText model to other baselines. The "Precision-at-one" (Prec@1) metric reports the proportion that highest ranking tag the the model predicts is in fact one of the real tags for an item.<br />
<br />
<br />
We can see the frequency baseline has the lowest accuracy. On running fastText for 5 epochs and we compare it to Tagspace results. fastText with bigrams is also used and models are tested with 50 and 200 hidden layers. With 50 hidden layers, both fastText (without bigrams) and Tagspace performed similarly while fastText with bigrams performed better at 50 hidden layers than Tagspace did at 200. At 200 hidden layers, fastText performs better still and with bigrams, accuracy is improved. <br />
<br />
<br />
Finally, both the train and test times of our model performed significantly better than Tagspace which must compute the scores for each class. fastText on the other hand, has a fast inference which gives it an advantage for a large number of classes. In particular, test times for fastText are significantly faster than that of Tagspace.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35274Bag of Tricks for Efficient Text Classification2018-03-22T20:18:05Z<p>Cs3yang: /* Baselines */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
<br />
=== Natural-Language Processing and Text Classification ===<br />
[https://en.wikipedia.org/wiki/Natural-language_processing Natural Language Processing] (NLP) is concerned with being able to process large amounts of natural language data, involving speech recognition, natural language understanding, and natural-language generation. Text understanding involves being able to understand the explicit or implicit meaning of elements of text such as words, phrases, sentences, and paragraphs, and making inferences about these properties of texts (Norvig, 1987). One of the main topics in NLP is text classification, which is assigning predefined categories to free-text documents, with research ranging in this field from designing the best features to choosing the best machine learning classifiers (Zhang et al. 2016).<br />
<br />
Traditionally, techniques for text classification are based on simple statistics on words, such as Bag of Words.<br />
<br />
With the advancement of deep learning and the availability of large data sets, methods of handling text understanding using deep learning techniques have become popular in recent years. These deep learning models have been shown to significantly perform better than these traditional models in several studies such as N-gram models.<br />
<br />
The following are the deep learning models that are compared to fastText in the Experiment Section:<br />
<br />
=== Char-CNN ===<br />
<br />
<br />
=== Char-CRNN ===<br />
<br />
A chactacter-level convolutional neural network (CRNN) is a a convolutional neural network (CNN) with a recurrent layer on top. A recurrent layer can efficiently capture long-term dependencies, therefore a model with only a very small number of convolutional layers is needed.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. <br />
<br />
[[File:xiao and cho 2016 pic.png]]<br />
Graphical illustration of (a) a CNN and (b) proposed CRNN for character-level document classification <br />
<br />
Source: [Xiao and Cho2016] Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv: 1602.00367.<br />
<br />
=== VDCNN ===<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of <math>y_n</math> as each possible class and outputs probability estimates for each class. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single <math>y_n</math> to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for <math>y_n</math>. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed deeper into the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path for more frequently labelled classes. Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>. <br />
<br />
A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf </i><br />
<br />
In the below figure, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
[[File:figure2.jpg|Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
<br />
Note that during the training process, vector representations for inner nodes, <math>v_{n_i}</math>, are updated by introducing an error function, <math>Err</math> and deriving the error function to obtain the following:<br />
<div style="text-align: center;"> <math> \frac{\partial Err}{\partial v_{n_i}^{'}}=\frac{\partial Err}{\partial v_{n_i}^{'}h} \cdot \frac{\partial v_{n_i}^{'}h }{\partial v_{n_i}^{'}} </math> <br></div><br />
<br />
We can use this to update the vector values with the following:<br />
<br />
<div style="text-align: center;"> <math> v_{n_i}'(new) = v_{n_i}'(old) - &eta; \frac{\partial Err}{\partial v_{n_i}'} </math><br></div><br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here, the spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
{| class="wikitable"<br />
|-<br />
! Input<br />
! Prediction<br />
! Tags<br />
|-<br />
| taiyoucon 2011 digitals: individuals digital photos from the anime convention taiyoucon 2011 in mesa, arizona. if you know the model and/or the character, please comment.<br />
| #cosplay<br />
| #24mm #anime #animeconvention #arizona #canon #con #convention #cos '''#cosplay''' #costume #mesa #play #taiyou #taiyoucon<br />
|-<br />
| 2012 twin cities pride 2012 twin cities pride parade<br />
| #minneapolis <br />
| #2012twincitiesprideparade '''#minneapolis''' #mn #usa<br />
|-<br />
| beagle enjoys the snowfall<br />
| #snow<br />
| #2007 #beagle #hillsboro #january #maddison #maddy #oregon '''#snow'''<br />
|-<br />
| christmas <br />
| #christmas <br />
| #cameraphone #mobile<br />
|-<br />
| euclid avenue <br />
| #newyorkcity <br />
| #cleveland #euclidavenue<br />
|}<br />
<br />
<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags. The precision of the model on the test set is reported at 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]). (Need to figure out what this means, maybe move this line later on).<br />
<br />
==== Baselines====<br />
A frequency-based approach which simply predicts the most frequent tag is used as a baseline.<br />
<br />
The model we compare fastText to in tag prediction is [http://www.aclweb.org/anthology/D14-1194 Tagspace]. It is similar to our model but is based on the [http://www.thespermwhale.com/jaseweston/papers/wsabie-ijcai.pdf WSABIE] which predicts tags based on both images and their text annotations. The Tagspace model is a convolution neural network, which predicts hashtags in semantic contexts of social network posts. For faster and better comparable performance, the experiment uses a linear version of this model. Tagspace also predicts multiple hashtags so this experiment does not fully compare the capabilities of the Tagspace model.<br />
<br />
==== Results ====<br />
<br />
{| class="wikitable"<br />
|-<br />
! Model <br />
! prec@1<br />
! Running time - Train<br />
! Running time - Test<br />
|- <br />
| Freq. baseline <br />
| 2.2 <br />
| - <br />
| -<br />
|- <br />
| Tagspace, h = 50 <br />
| 30.1 <br />
| 3h8 <br />
| 6h<br />
|- <br />
| Tagspace, h = 200 <br />
| 35.6 <br />
| 5h32 <br />
| 15h<br />
|- <br />
| fastText, h = 50 <br />
| 30.8 <br />
| 6m40 <br />
| 48s<br />
|- <br />
| fastText, h = 50, bigram <br />
| 35.6 <br />
| 7m47 <br />
| 50s<br />
|- <br />
| fastText, h = 200 <br />
| 40.7 <br />
| 10m34 <br />
| 1m29<br />
|-<br />
| fastText, h = 200, bigram <br />
| 45.1 <br />
| 13m38 <br />
| 1m37<br />
|}<br />
<br />
<br />
<br />
The above table presents a comparison of our fastText model to other baselines. The Precision-at-one(Prec@1) is the number of wins over the total number of search terms tried using the model. <br />
<br />
<br />
We can see the frequency baseline has the lowest accuracy. On running fastText for 5 epochs and we compare it to Tagspace results. fastText with bigrams is also used and models are tested with 50 and 200 hidden layers. With 50 hidden layers, both fastText (without bigrams) and Tagspace performed similarly while fastText with bigrams performed better at 50 hidden layers than Tagspace did at 200. At 200 hidden layers, fastText performs better still and with bigrams, accuracy is improved. (Talk a little more about the accuracy the units and what it implies at the begining of accuracy section).<br />
<br />
<br />
<br />
Finally, both the train and test times of our model performed significantly better than Tagspace which must compute the scores for each class. fastText on the other hand, has a fast inference which gives it an advantage for a large number of classes. In particular, test times for fastText are significantly faster than that of Tagspace.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35273Bag of Tricks for Efficient Text Classification2018-03-22T20:17:38Z<p>Cs3yang: /* Baselines */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
<br />
=== Natural-Language Processing and Text Classification ===<br />
[https://en.wikipedia.org/wiki/Natural-language_processing Natural Language Processing] (NLP) is concerned with being able to process large amounts of natural language data, involving speech recognition, natural language understanding, and natural-language generation. Text understanding involves being able to understand the explicit or implicit meaning of elements of text such as words, phrases, sentences, and paragraphs, and making inferences about these properties of texts (Norvig, 1987). One of the main topics in NLP is text classification, which is assigning predefined categories to free-text documents, with research ranging in this field from designing the best features to choosing the best machine learning classifiers (Zhang et al. 2016).<br />
<br />
Traditionally, techniques for text classification are based on simple statistics on words, such as Bag of Words.<br />
<br />
With the advancement of deep learning and the availability of large data sets, methods of handling text understanding using deep learning techniques have become popular in recent years. These deep learning models have been shown to significantly perform better than these traditional models in several studies such as N-gram models.<br />
<br />
The following are the deep learning models that are compared to fastText in the Experiment Section:<br />
<br />
=== Char-CNN ===<br />
<br />
<br />
=== Char-CRNN ===<br />
<br />
A chactacter-level convolutional neural network (CRNN) is a a convolutional neural network (CNN) with a recurrent layer on top. A recurrent layer can efficiently capture long-term dependencies, therefore a model with only a very small number of convolutional layers is needed.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. <br />
<br />
[[File:xiao and cho 2016 pic.png]]<br />
Graphical illustration of (a) a CNN and (b) proposed CRNN for character-level document classification <br />
<br />
Source: [Xiao and Cho2016] Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv: 1602.00367.<br />
<br />
=== VDCNN ===<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of <math>y_n</math> as each possible class and outputs probability estimates for each class. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single <math>y_n</math> to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for <math>y_n</math>. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed deeper into the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path for more frequently labelled classes. Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>. <br />
<br />
A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf </i><br />
<br />
In the below figure, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
[[File:figure2.jpg|Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
<br />
Note that during the training process, vector representations for inner nodes, <math>v_{n_i}</math>, are updated by introducing an error function, <math>Err</math> and deriving the error function to obtain the following:<br />
<div style="text-align: center;"> <math> \frac{\partial Err}{\partial v_{n_i}^{'}}=\frac{\partial Err}{\partial v_{n_i}^{'}h} \cdot \frac{\partial v_{n_i}^{'}h }{\partial v_{n_i}^{'}} </math> <br></div><br />
<br />
We can use this to update the vector values with the following:<br />
<br />
<div style="text-align: center;"> <math> v_{n_i}'(new) = v_{n_i}'(old) - &eta; \frac{\partial Err}{\partial v_{n_i}'} </math><br></div><br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here, the spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
{| class="wikitable"<br />
|-<br />
! Input<br />
! Prediction<br />
! Tags<br />
|-<br />
| taiyoucon 2011 digitals: individuals digital photos from the anime convention taiyoucon 2011 in mesa, arizona. if you know the model and/or the character, please comment.<br />
| #cosplay<br />
| #24mm #anime #animeconvention #arizona #canon #con #convention #cos '''#cosplay''' #costume #mesa #play #taiyou #taiyoucon<br />
|-<br />
| 2012 twin cities pride 2012 twin cities pride parade<br />
| #minneapolis <br />
| #2012twincitiesprideparade '''#minneapolis''' #mn #usa<br />
|-<br />
| beagle enjoys the snowfall<br />
| #snow<br />
| #2007 #beagle #hillsboro #january #maddison #maddy #oregon '''#snow'''<br />
|-<br />
| christmas <br />
| #christmas <br />
| #cameraphone #mobile<br />
|-<br />
| euclid avenue <br />
| #newyorkcity <br />
| #cleveland #euclidavenue<br />
|}<br />
<br />
<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags. The precision of the model on the test set is reported at 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]). (Need to figure out what this means, maybe move this line later on).<br />
<br />
==== Baselines====<br />
A frequency-based approach which simply predicts the most frequent tag is used as a baseline.<br />
<br />
The model we compare fastText to in tag prediction is [http://www.aclweb.org/anthology/D14-1194 Tagspace]. It is similar to our model but is based on the [http://www.thespermwhale.com/jaseweston/papers/wsabie-ijcai.pdf WSABIE] which predicts tags based on both images and their text annotations.<br />
<br />
The Tagspace model is a convolution neural network, which predicts hashtags in semantic contexts of social network posts. For faster and better comparable performance, the experiment uses a linear version of this model. Tagspace also predicts multiple hashtags so this experiment does not fully compare the capabilities of the Tagspace model.<br />
<br />
==== Results ====<br />
<br />
{| class="wikitable"<br />
|-<br />
! Model <br />
! prec@1<br />
! Running time - Train<br />
! Running time - Test<br />
|- <br />
| Freq. baseline <br />
| 2.2 <br />
| - <br />
| -<br />
|- <br />
| Tagspace, h = 50 <br />
| 30.1 <br />
| 3h8 <br />
| 6h<br />
|- <br />
| Tagspace, h = 200 <br />
| 35.6 <br />
| 5h32 <br />
| 15h<br />
|- <br />
| fastText, h = 50 <br />
| 30.8 <br />
| 6m40 <br />
| 48s<br />
|- <br />
| fastText, h = 50, bigram <br />
| 35.6 <br />
| 7m47 <br />
| 50s<br />
|- <br />
| fastText, h = 200 <br />
| 40.7 <br />
| 10m34 <br />
| 1m29<br />
|-<br />
| fastText, h = 200, bigram <br />
| 45.1 <br />
| 13m38 <br />
| 1m37<br />
|}<br />
<br />
<br />
<br />
The above table presents a comparison of our fastText model to other baselines. The Precision-at-one(Prec@1) is the number of wins over the total number of search terms tried using the model. <br />
<br />
<br />
We can see the frequency baseline has the lowest accuracy. On running fastText for 5 epochs and we compare it to Tagspace results. fastText with bigrams is also used and models are tested with 50 and 200 hidden layers. With 50 hidden layers, both fastText (without bigrams) and Tagspace performed similarly while fastText with bigrams performed better at 50 hidden layers than Tagspace did at 200. At 200 hidden layers, fastText performs better still and with bigrams, accuracy is improved. (Talk a little more about the accuracy the units and what it implies at the begining of accuracy section).<br />
<br />
<br />
<br />
Finally, both the train and test times of our model performed significantly better than Tagspace which must compute the scores for each class. fastText on the other hand, has a fast inference which gives it an advantage for a large number of classes. In particular, test times for fastText are significantly faster than that of Tagspace.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35266Bag of Tricks for Efficient Text Classification2018-03-22T19:37:12Z<p>Cs3yang: /* Dataset */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
<br />
=== Natural-Language Processing and Text Classification ===<br />
[https://en.wikipedia.org/wiki/Natural-language_processing Natural Language Processing] (NLP) is concerned with being able to process large amounts of natural language data, involving speech recognition, natural language understanding, and natural-language generation. Text understanding involves being able to understand the explicit or implicit meaning of elements of text such as words, phrases, sentences, and paragraphs, and making inferences about these properties of texts (Norvig, 1987). One of the main topics in NLP is text classification, which is assigning predefined categories to free-text documents, with research ranging in this field from designing the best features to choosing the best machine learning classifiers (Zhang et al. 2016).<br />
<br />
Traditionally, techniques for text classification are based on simple statistics on words. (Talk about linear classifiers? typical models?)<br />
<br />
With the advancement of deep learning and the availability of large data sets, methods of handling text understanding using deep learning techniques have become available in recent years. (talk about deep learning models briefly)<br />
<br />
* Briefly describe the difference between NLP and text-mining. Maybe comment later about whether fastText accomplishes NLP.<br />
<br />
=== Char-CNN ===<br />
<br />
<br />
=== Char-CRNN ===<br />
<br />
A chactacter-level convolutional neural network (CRNN) is a a convolutional neural network (CNN) with a recurrent layer on top. A recurrent layer can efficiently capture long-term dependencies, therefore a model with only a very small number of convolutional layers is needed.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. <br />
<br />
[[File:xiao and cho 2016 pic.png]]<br />
Graphical illustration of (a) a CNN and (b) proposed CRNN for character-level document classification <br />
<br />
Source: [Xiao and Cho2016] Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv: 1602.00367.<br />
<br />
=== VDCNN ===<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of <math>y_n</math> as each possible class and outputs probability estimates for each class. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single <math>y_n</math> to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for <math>y_n</math>. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed deeper into the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path for more frequently labelled classes. Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>. <br />
<br />
A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf </i><br />
<br />
In the below figure, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
[[File:figure2.jpg|Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
<br />
Note that during the training process, vector representations for inner nodes, <math>v_{n_i}</math>, are updated by introducing an error function, <math>Err</math> and deriving the error function to obtain the following:<br />
<div style="text-align: center;"> <math> \frac{\partial Err}{\partial v_{n_i}^{'}}=\frac{\partial Err}{\partial v_{n_i}^{'}h} \cdot \frac{\partial v_{n_i}^{'}h }{\partial v_{n_i}^{'}} </math> <br></div><br />
<br />
We can use this to update the vector values with the following:<br />
<br />
<div style="text-align: center;"> <math> v_{n_i}'(new) = v_{n_i}'(old) - &eta; \frac{\partial Err}{\partial v_{n_i}'} </math><br></div><br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here, the spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
{| class="wikitable"<br />
|-<br />
! Input<br />
! Prediction<br />
! Tags<br />
|-<br />
| taiyoucon 2011 digitals: individuals digital photos from the anime convention taiyoucon 2011 in mesa, arizona. if you know the model and/or the character, please comment.<br />
| #cosplay<br />
| #24mm #anime #animeconvention #arizona #canon #con #convention #cos '''#cosplay''' #costume #mesa #play #taiyou #taiyoucon<br />
|-<br />
| 2012 twin cities pride 2012 twin cities pride parade<br />
| #minneapolis <br />
| #2012twincitiesprideparade '''#minneapolis''' #mn #usa<br />
|-<br />
| beagle enjoys the snowfall<br />
| #snow<br />
| #2007 #beagle #hillsboro #january #maddison #maddy #oregon '''#snow'''<br />
|-<br />
| christmas <br />
| #christmas <br />
| #cameraphone #mobile<br />
|-<br />
| euclid avenue <br />
| #newyorkcity <br />
| #cleveland #euclidavenue<br />
|}<br />
<br />
<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags. The precision of the model on the test set is reported at 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]). (Need to figure out what this means, maybe move this line later on).<br />
<br />
==== Baselines====<br />
A frequency-based approach which simply predicts the most frequent tag is used as a baseline.<br />
<br />
The model we compare fastText to in tag prediction is [http://www.aclweb.org/anthology/D14-1194 Tagspace]. It is similar to our model but is based on the [http://www.thespermwhale.com/jaseweston/papers/wsabie-ijcai.pdf WSABIE] which predicts tags based on both images and their text annotations.<br />
<br />
Tagspace is chosen as a comparison model because, similar to fastText, it is designed to predict tags. However, Tagspace predicts multiple tags, and the authors only evaluates prec@1, i,e, if their single tag prediction is correct or not. Hence, it does not fully compare the capabilities of the Tagspace model.<br />
<br />
The Tagspace model is a convolution neural network, which predicts hashtags in semantic contexts of social network posts. For faster and better comparable performance, we consider the linear version of this model. (Linear version of Tagspace needs to be explained).<br />
<br />
==== Results ====<br />
<br />
{| class="wikitable"<br />
|-<br />
! Model <br />
! prec@1<br />
! Running time - Train<br />
! Running time - Test<br />
|- <br />
| Freq. baseline <br />
| 2.2 <br />
| - <br />
| -<br />
|- <br />
| Tagspace, h = 50 <br />
| 30.1 <br />
| 3h8 <br />
| 6h<br />
|- <br />
| Tagspace, h = 200 <br />
| 35.6 <br />
| 5h32 <br />
| 15h<br />
|- <br />
| fastText, h = 50 <br />
| 30.8 <br />
| 6m40 <br />
| 48s<br />
|- <br />
| fastText, h = 50, bigram <br />
| 35.6 <br />
| 7m47 <br />
| 50s<br />
|- <br />
| fastText, h = 200 <br />
| 40.7 <br />
| 10m34 <br />
| 1m29<br />
|-<br />
| fastText, h = 200, bigram <br />
| 45.1 <br />
| 13m38 <br />
| 1m37<br />
|}<br />
<br />
<br />
<br />
The above table presents a comparison of our fastText model to other baselines. The Precision-at-one(Prec@1) is the number of wins over the total number of search terms tried using the model. <br />
<br />
<br />
We can see the frequency baseline has the lowest accuracy. On running fastText for 5 epochs and we compare it to Tagspace results. fastText with bigrams is also used and models are tested with 50 and 200 hidden layers. With 50 hidden layers, both fastText (without bigrams) and Tagspace performed similarly while fastText with bigrams performed better at 50 hidden layers than Tagspace did at 200. At 200 hidden layers, fastText performs better still and with bigrams, accuracy is improved. (Talk a little more about the accuracy the units and what it implies at the begining of accuracy section).<br />
<br />
<br />
<br />
Finally, both the train and test times of our model performed significantly better than Tagspace which must compute the scores for each class. fastText on the other hand, has a fast inference which gives it an advantage for a large number of classes. In particular, test times for fastText are significantly faster than that of Tagspace.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35136Bag of Tricks for Efficient Text Classification2018-03-22T04:24:45Z<p>Cs3yang: /* Results and training time. */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
* PLACEHOLDER: we should look at when this <br />
<br />
=== Natural-Language Processing ===<br />
<br />
* Briefly describe the difference between NLP and text-mining. Maybe comment later about whether fastText accomplishes NLP.<br />
<br />
<br />
=== Char-CNN ===<br />
<br />
<br />
=== Char-CRNN ===<br />
<br />
A chactacter-level convolutional neural network (CRNN) is a a convolutional neural network (CNN) with a recurrent layer on top. A recurrent layer can efficiently capture long-term dependencies, therefore a model with only a very small number of convolutional layers is needed.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. <br />
<br />
[[File:xiao and cho 2016 pic.png]]<br />
Graphical illustration of (a) a CNN and (b) proposed CRNN for character-level document classification <br />
<br />
Source: [Xiao and Cho2016] Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv: 1602.00367.<br />
<br />
=== VDCNN ===<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of <math>y_n</math> as each possible class and outputs probability estimates for each class. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single <math>y_n</math> to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for <math>y_n</math>. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed deeper into the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path for more frequently labelled classes. Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>. <br />
<br />
A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf </i><br />
<br />
In the below figure, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
[[File:figure2.jpg|Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
<br />
Note that during the training process, vector representations for inner nodes, <math>v_{n_i}</math>, are updated by introducing an error function, <math>Err</math> and deriving the error function to obtain the following:<br />
<div style="text-align: center;"> <math> \frac{\partial Err}{\partial v_{n_i}^{'}}=\frac{\partial Err}{\partial v_{n_i}^{'}h} \cdot \frac{\partial v_{n_i}^{'}h }{\partial v_{n_i}^{'}} </math> <br></div><br />
<br />
We can use this to update the vector values with the following:<br />
<br />
<div style="text-align: center;"> <math> v_{n_i}'(new) = v_{n_i}'(old) - &eta; \frac{\partial Err}{\partial v_{n_i}'} </math><br></div><br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here the, spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
*PLACEHOLDER FOR TABLE 4<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags. The precision of the model on the test set is reported at 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]). (Need to figure out what this means, maybe move this line later on).<br />
<br />
==== Baselines====<br />
A frequency-based approach which simply predicts the most frequent tag is used as a baseline.<br />
<br />
The model we compare fastText to in tag prediction is [http://www.aclweb.org/anthology/D14-1194 Tagspace]. It is similar to our model but is based on the [http://www.thespermwhale.com/jaseweston/papers/wsabie-ijcai.pdf WSABIE] which predicts tags based on both images and their text annotations (how is it similar? is this NN or SVM or what?) The Tagspace model is a convolution neural network, which predicts hashtags in semantic contexts of social network posts. For faster and better comparable performance, we consider the linear version of this model. (Linear version of Tagspace needs to be explained).<br />
<br />
==== Results ====<br />
<br />
* PLACEHOLDER Insert Table 5<br />
<br />
The above table presents a comparison of our fastText model to other baselines. (explain what prec@1 means)<br />
<br />
We can see the frequency baseline has the lowest accuracy. On running fastText for 5 epochs and we compare it to Tagspace results. fastText with bigrams is also used and models are tested with 50 and 200 hidden layers. With 50 hidden layers, both fastText (without bigrams) and Tagspace performed similarly while fastText with bigrams performed better at 50 hidden layers than Tagspace did at 200. At 200 hidden layers, fastText performs better still and with bigrams, accuracy is improved. (Talk a little more about the accuracy the units and what it implies at the begining of accuracy section).<br />
<br />
<br />
<br />
Finally, both the train and test times of our model performed significantly better than Tagspace which must compute the scores for each class. fastText on the other hand, has a fast inference which gives it an advantage for a large number of classes. In particular, test times for fastText are significantly faster than that of Tagspace.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35135Bag of Tricks for Efficient Text Classification2018-03-22T04:22:00Z<p>Cs3yang: /* Baseline */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
* PLACEHOLDER: we should look at when this <br />
<br />
=== Natural-Language Processing ===<br />
<br />
* Briefly describe the difference between NLP and text-mining. Maybe comment later about whether fastText accomplishes NLP.<br />
<br />
<br />
=== Char-CNN ===<br />
<br />
<br />
=== Char-CRNN ===<br />
<br />
A chactacter-level convolutional neural network (CRNN) is a a convolutional neural network (CNN) with a recurrent layer on top. A recurrent layer can efficiently capture long-term dependencies, therefore a model with only a very small number of convolutional layers is needed.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. <br />
<br />
[[File:xiao and cho 2016 pic.png]]<br />
Graphical illustration of (a) a CNN and (b) proposed CRNN for character-level document classification <br />
<br />
Source: [Xiao and Cho2016] Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv: 1602.00367.<br />
<br />
=== VDCNN ===<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of <math>y_n</math> as each possible class and outputs probability estimates for each class. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single <math>y_n</math> to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for <math>y_n</math>. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed deeper into the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path for more frequently labelled classes. Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>. <br />
<br />
A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf </i><br />
<br />
In the below figure, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
[[File:figure2.jpg|Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
<br />
Note that during the training process, vector representations for inner nodes, <math>v_{n_i}</math>, are updated by introducing an error function, <math>Err</math> and deriving the error function to obtain the following:<br />
<div style="text-align: center;"> <math> \frac{\partial Err}{\partial v_{n_i}^{'}}=\frac{\partial Err}{\partial v_{n_i}^{'}h} \cdot \frac{\partial v_{n_i}^{'}h }{\partial v_{n_i}^{'}} </math> <br></div><br />
<br />
We can use this to update the vector values with the following:<br />
<br />
<div style="text-align: center;"> <math> v_{n_i}'(new) = v_{n_i}'(old) - &eta; \frac{\partial Err}{\partial v_{n_i}'} </math><br></div><br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here the, spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
*PLACEHOLDER FOR TABLE 4<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags. The precision of the model on the test set is reported at 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]). (Need to figure out what this means, maybe move this line later on).<br />
<br />
==== Baselines====<br />
A frequency-based approach which simply predicts the most frequent tag is used as a baseline.<br />
<br />
The model we compare fastText to in tag prediction is [http://www.aclweb.org/anthology/D14-1194 Tagspace]. It is similar to our model but is based on the [http://www.thespermwhale.com/jaseweston/papers/wsabie-ijcai.pdf WSABIE] which predicts tags based on both images and their text annotations (how is it similar? is this NN or SVM or what?) The Tagspace model is a convolution neural network, which predicts hashtags in semantic contexts of social network posts. For faster and better comparable performance, we consider the linear version of this model. (Linear version of Tagspace needs to be explained).<br />
<br />
====Results and training time. ====<br />
<br />
* PLACEHOLDER Insert Table 5<br />
<br />
The above table presents a comparison of our fastText model to other baselines. (explain what prec@1 means)<br />
<br />
We can see the frequency baseline has the lowest accuracy. On running fastText for 5 epochs and we compare it to Tagspace results. fastText with bigrams is also used and models are tested with 50 and 200 hidden layers. With 50 hidden layers, both fastText (without bigrams) and Tagspace performed similarly while fastText with bigrams performed better at 50 hidden layers than Tagspace did at 200. At 200 hidden layers, fastText performs better still and with bigrams, accuracy is improved. (Talk a little more about the accuracy the units and what it implies at the begining of accuracy section).<br />
<br />
Finally, both the train and test times of our model performed significantly better than Tagspace which must compute the scores for each class. fastText on the other hand, has a fast inference which gives it an advantage for a large number of classes. In particular, test times for fastText are significantly faster than that of Tagspace.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35133Bag of Tricks for Efficient Text Classification2018-03-22T04:16:54Z<p>Cs3yang: /* Results and training time. */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
* PLACEHOLDER: we should look at when this <br />
<br />
=== Natural-Language Processing ===<br />
<br />
* Briefly describe the difference between NLP and text-mining. Maybe comment later about whether fastText accomplishes NLP.<br />
<br />
<br />
=== Char-CNN ===<br />
<br />
<br />
=== Char-CRNN ===<br />
<br />
A chactacter-level convolutional neural network (CRNN) is a a convolutional neural network (CNN) with a recurrent layer on top.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. <br />
<br />
[[File:xiao and cho 2016 pic.png]]<br />
Graphical illustration of (a) a CNN and (b) proposed CRNN for character-level document classification <br />
<br />
Source: [Xiao and Cho2016] Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv: 1602.00367.<br />
<br />
=== VDCNN ===<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of <math>y_n</math> as each possible class and outputs probability estimates for each class. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single <math>y_n</math> to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for <math>y_n</math>. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed deeper into the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path for more frequently labelled classes. Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>. <br />
<br />
A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf </i><br />
<br />
In the below figure, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
[[File:figure2.jpg|Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
<br />
Note that during the training process, vector representations for inner nodes, <math>v_{n_i}</math>, are updated by introducing an error function, <math>Err</math> and deriving the error function to obtain the following:<br />
<div style="text-align: center;"> <math> \frac{\partial Err}{\partial v_{n_i}^{'}}=\frac{\partial Err}{\partial v_{n_i}^{'}h} \cdot \frac{\partial v_{n_i}^{'}h }{\partial v_{n_i}^{'}} </math> <br></div><br />
<br />
We can use this to update the vector values with the following:<br />
<br />
<div style="text-align: center;"> <math> v_{n_i}'(new) = v_{n_i}'(old) - &eta; \frac{\partial Err}{\partial v_{n_i}'} </math><br></div><br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here the, spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
*PLACEHOLDER FOR TABLE 4<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags. The precision of the model on the test set is reported at 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]). (Need to figure out what this means, maybe move this line later on).<br />
<br />
==== Baseline ====<br />
The baseline used is [http://www.aclweb.org/anthology/D14-1194 Tagspace], a frequency based model which predicting the most frequent tag.(What does this do and how does it help?) It is similar to our model but is based on the [http://www.thespermwhale.com/jaseweston/papers/wsabie-ijcai.pdf WSABIE] which predicts tags based on both images and their text annotations (how is it similar? is this NN or SVM or what?) The Tagspace model is a convolution neural network, which predicts hashtags in semantic contexts of social network posts. For faster and better comparable performance, we consider the linear version of this model. (Linear version of Tagspace needs to be explained).<br />
<br />
====Results and training time. ====<br />
<br />
* PLACEHOLDER Insert Table 5<br />
<br />
The above table presents a comparison of our fastText model to other baselines. (explain what prec@1 means)<br />
<br />
We can see the frequency baseline has the lowest accuracy. On running fastText for 5 epochs and we compare it to Tagspace results. fastText with bigrams is also used and models are tested with 50 and 200 hidden layers. With 50 hidden layers, both fastText (without bigrams) and Tagspace performed similarly while fastText with bigrams performed better at 50 hidden layers than Tagspace did at 200. At 200 hidden layers, fastText performs better still and with bigrams, accuracy is improved. (Talk a little more about the accuracy the units and what it implies at the begining of accuracy section).<br />
<br />
Finally, both the train and test times of our model performed significantly better than Tagspace which must compute the scores for each class. fastText on the other hand, has a fast inference which gives it an advantage for a large number of classes. In particular, test times for fastText are significantly faster than that of Tagspace.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35132Bag of Tricks for Efficient Text Classification2018-03-22T04:16:16Z<p>Cs3yang: /* Results and training time. */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
* PLACEHOLDER: we should look at when this <br />
<br />
=== Natural-Language Processing ===<br />
<br />
* Briefly describe the difference between NLP and text-mining. Maybe comment later about whether fastText accomplishes NLP.<br />
<br />
<br />
=== Char-CNN ===<br />
<br />
<br />
=== Char-CRNN ===<br />
<br />
A chactacter-level convolutional neural network (CRNN) is a a convolutional neural network (CNN) with a recurrent layer on top.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. <br />
<br />
[[File:xiao and cho 2016 pic.png]]<br />
Graphical illustration of (a) a CNN and (b) proposed CRNN for character-level document classification <br />
<br />
Source: [Xiao and Cho2016] Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv: 1602.00367.<br />
<br />
=== VDCNN ===<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of <math>y_n</math> as each possible class and outputs probability estimates for each class. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single <math>y_n</math> to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for <math>y_n</math>. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed deeper into the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path for more frequently labelled classes. Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>. <br />
<br />
A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf </i><br />
<br />
In the below figure, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
[[File:figure2.jpg|Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
<br />
Note that during the training process, vector representations for inner nodes, <math>v_{n_i}</math>, are updated by introducing an error function, <math>Err</math> and deriving the error function to obtain the following:<br />
<div style="text-align: center;"> <math> \frac{\partial Err}{\partial v_{n_i}^{'}}=\frac{\partial Err}{\partial v_{n_i}^{'}h} \cdot \frac{\partial v_{n_i}^{'}h }{\partial v_{n_i}^{'}} </math> <br></div><br />
<br />
We can use this to update the vector values with the following:<br />
<br />
<div style="text-align: center;"> <math> v_{n_i}'(new) = v_{n_i}'(old) - &eta; \frac{\partial Err}{\partial v_{n_i}'} </math><br></div><br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here the, spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
*PLACEHOLDER FOR TABLE 4<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags. The precision of the model on the test set is reported at 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]). (Need to figure out what this means, maybe move this line later on).<br />
<br />
==== Baseline ====<br />
The baseline used is [http://www.aclweb.org/anthology/D14-1194 Tagspace], a frequency based model which predicting the most frequent tag.(What does this do and how does it help?) It is similar to our model but is based on the [http://www.thespermwhale.com/jaseweston/papers/wsabie-ijcai.pdf WSABIE] which predicts tags based on both images and their text annotations (how is it similar? is this NN or SVM or what?) The Tagspace model is a convolution neural network, which predicts hashtags in semantic contexts of social network posts. For faster and better comparable performance, we consider the linear version of this model. (Linear version of Tagspace needs to be explained).<br />
<br />
====Results and training time. ====<br />
<br />
* PLACEHOLDER Insert Table 5<br />
<br />
The above table presents a comparison of our fastText model to other baselines. (explain what prec@1 means)<br />
<br />
===== Accuracy =====<br />
We can see the frequency baseline has the lowest accuracy. On running fastText for 5 epochs and we compare it to Tagspace results. fastText with bigrams is also used and models are tested with 50 and 200 hidden layers. With 50 hidden layers, both fastText (without bigrams) and Tagspace performed similarly while fastText with bigrams performed better at 50 hidden layers than Tagspace did at 200. At 200 hidden layers, fastText performs better still and with bigrams, accuracy is improved. (Talk a little more about the accuracy the units and what it implies at the begining of accuracy section).<br />
<br />
<br />
===== Running Time =====<br />
Finally, both the train and test times of our model performed significantly better than Tagspace which must compute the scores for each class. fastText on the other hand, has a fast inference which gives it an advantage for a large number of classes. In particular, test times for fastText are significantly faster than that of Tagspace.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35121Bag of Tricks for Efficient Text Classification2018-03-22T03:51:08Z<p>Cs3yang: /* Tag prediction */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
* PLACEHOLDER: we should look at when this <br />
<br />
=== Natural-Language Processing ===<br />
<br />
* Briefly describe the difference between NLP and text-mining. Maybe comment later about whether fastText accomplishes NLP.<br />
<br />
<br />
=== Char-CNN ===<br />
<br />
<br />
=== Char-CRNN ===<br />
<br />
A paper by Xiao and Cho (2016) proposes a convolutional recurrent neural network (CRNN) for text classification after inspiration from the paper by Zhang et al., 2015 which proposes a CNN. Xiao and Cho propose a smaller model with fewer convolutional layers that can achieve similar classification performance with a recurrent layer on top of a CNN. <br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. A bidirectional recurrent layer can also be used which is composed of two recurrent layers working in opposite directions. This is to alleviate an imbalance in the amount of information seen by the hidden state at different time steps. This layer returns two sequences of hidden states from the forward and reverse recurrent layers.<br />
<br />
Brief Summary of their Model:<br />
<br />
A one-hot sequence input <math>(x_1, x_2, ,..., x_T) </math> is turned into a sequence of dense, real valued vectors <math> E = (e_1, e_2, ..., e_T) </math> using the embedding layer. After, multiple convolutional layers are applied to <math> E </math> to get a shorter sequence of feature vectors: <math> F = (f_1, f_2, ...,f_{T'}) </math> . This feature vector is then fed into a bidirectional recurrent layer, resulting in two sequences <math>H_{forward} = (\vec{h_1}, \vec{h_2},...,\vec{h_{T'}})</math>, <math>H_{reverse} = (\overleftarrow{h_1}, \overleftarrow{h_2},...,\overleftarrow{h_{T'}})</math>. The last hidden states were taken of both directions and were concatenated to form a fixed dimensional vector, <math> h = [\vec{h_{T'}}; \overleftarrow{h_1}]</math>, which is fed into the classification layer to compute the predictive probabilities <math> p(y = k|X) </math> of all k classes given the input sequence <math> X </math>.<br />
<br />
<br />
[[File:xiao and cho 2016 pic.png]]<br />
Graphical illustration of (a) a CNN and (b) proposed CRNN for character-level document classification <br />
<br />
Source: [Xiao and Cho2016] Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv: 1602.00367.<br />
<br />
=== VDCNN ===<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of a ____ as each possible label and outputs probability estimates for each label. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single ____ to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for the _____. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed in the lower leaves of the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path of the random walk for more frequently labelled classes. Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>. <br />
<br />
A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf. </i><br />
<br />
In the below figure, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
[[File:figure2.jpg|Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
During the training process, vector representations for inner nodes, <math>v_{n_i}</math>, are updated by introducing an error function and deriving the error function to obtain the following:<br />
<div style="text-align: center;"> <math> \frac{\mathrm{d}E}{\mathrm{d} v_{n_i}^{'}}=\frac{\mathrm{d}E}{\mathrm{d} v_{n_i}^{'}h} \cdot \frac{\mathrm{d} v_{n_i}^{'}h }{\mathrm{d} v_{n_i}^{'}} </math> <br></div><br />
We can use this to update the vector values with the following:<br />
<div style="text-align: center;"> <math> v_{n_i}^{'}^{new} = v^'_{n_i}^{'}^{old} - &eta; \frac{\mathrm{d}E}{\mathrm{d} v_{n_i}^{'}} </math><br></div><br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here the, spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
*PLACEHOLDER FOR TABLE 4<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags. The precision of the model on the test set is reported at 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]). (Need to figure out what this means, maybe move this line later on).<br />
<br />
==== Baseline ====<br />
The baseline used is [http://www.aclweb.org/anthology/D14-1194 Tagspace], a frequency based model which predicting the most frequent tag.(What does this do and how does it help?) It is similar to our model but is based on the [http://www.thespermwhale.com/jaseweston/papers/wsabie-ijcai.pdf WSABIE] which predicts tags based on both images and their text annotations (how is it similar? is this NN or SVM or what?) The Tagspace model is a convolution neural network, which predicts hashtags in semantic contexts of social network posts. For faster and better comparable performance, we consider the linear version of this model. (Linear version of Tagspace needs to be explained).<br />
<br />
====Results and training time. ====<br />
<br />
* PLACEHOLDER Insert Table 5<br />
<br />
The above table presents a comparison of our fastText model to other baselines. (This tble needs to be explained, especially what prec@1 means)<br />
<br />
On running fastText for 5 epochs, we compare it to Tagspace results for two sizes of the hidden layer. Both models achieve a similar performance with a small hidden layer, while fastText being slightly more accurate. However, the addition of boosts the accuracy by a significant amount(Talk a little more about the accuracy the units and what it implies).<br />
<br />
Finally, at test time, our model performed significantly better. The Tagspace model algorithm calculates the scores for all the classes which takes up a significant amount of time. FastText on the other hand, has a fast inference for a large number of classes which is more than 300k in this data set providing a significant speed-up on the test time. Overall, we are more than an order of magnitude faster to obtain model with a better quality (a 600× speedup). (Rephrase this sentence?) Table 4 shows some qualitative examples.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35116Bag of Tricks for Efficient Text Classification2018-03-22T03:33:33Z<p>Cs3yang: /* Baseline */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
* PLACEHOLDER: we should look at when this <br />
<br />
=== Natural-Language Processing ===<br />
<br />
* Briefly describe the difference between NLP and text-mining. Maybe comment later about whether fastText accomplishes NLP.<br />
<br />
<br />
=== Char-CNN ===<br />
<br />
<br />
=== Char-CRNN ===<br />
<br />
A paper by Xiao and Cho (2016) proposes a convolutional recurrent neural network (CRNN) for text classification after inspiration from the paper by Zhang et al., 2015 which proposes a CNN. Xiao and Cho propose a smaller model with fewer convolutional layers that can achieve similar classification performance with a recurrent layer on top of a CNN. <br />
<br />
The recurrent layer can capture long-term dependencies. As a result, the network only needs a very small number of convolutional layers. However, the recurrent layer is computationally expensive.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. A bidirectional recurrent layer can also be used which is composed of two recurrent layers working in opposite directions. This is to alleviate an imbalance in the amount of information seen by the hidden state at different time steps. This layer returns two sequences of hidden states from the forward and reverse recurrent layers.<br />
<br />
Brief Summary of their Model:<br />
<br />
A one-hot sequence input <math>(x_1, x_2, ,..., x_T) </math> is turned into a sequence of dense, real valued vectors <math> E = (e_1, e_2, ..., e_T) </math> using the embedding layer. After, multiple convolutional layers are applied to <math> E </math> to get a shorter sequence of feature vectors: <math> F = (f_1, f_2, ...,f_{T'}) </math> . This feature vector is then fed into a bidirectional recurrent layer, resulting in two sequences <math>H_{forward} = (\vec{h_1}, \vec{h_2},...,\vec{h_{T'}})</math>, <math>H_{reverse} = (\overleftarrow{h_1}, \overleftarrow{h_2},...,\overleftarrow{h_{T'}})</math>. The last hidden states were taken of both directions and were concatenated to form a fixed dimensional vector, <math> h = [\vec{h_{T'}}; \overleftarrow{h_1}]</math>, which is fed into the classification layer to compute the predictive probabilities <math> p(y = k|X) </math> of all k classes given the input sequence <math> X </math>.<br />
<br />
<br />
[[File:xiao and cho 2016 pic.png]]<br />
Graphical illustration of (a) a CNN and (b) proposed CRNN for character-level document classification <br />
<br />
Source: [Xiao and Cho2016] Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv: 1602.00367.<br />
<br />
=== VDCNN ===<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of a ____ as each possible label and outputs probability estimates for each label. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single ____ to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for the _____. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed in the lower leaves of the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path of the random walk for more frequently labelled classes. Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>. <br />
<br />
A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf. </i><br />
<br />
In the below figure, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
[[File:figure2.jpg|Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here the, spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
*PLACEHOLDER FOR TABLE 4<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags. The precision score was 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]).<br />
<br />
==== Baseline ====<br />
The baseline used is [http://www.aclweb.org/anthology/D14-1194 Tagspace], a frequency based model which predicting the most frequent tag.(What does this do and how does it help?) It is similar to our model but is based on the Wsabie model of Weston et al. (how is it similar?) The Tagspace model is a convolution neural network, which predicts hashtags in semantic contexts of social network posts. For faster and better comparable performance, we consider the linear version of this model. (Tagspace needs to be explained properly).<br />
<br />
====Results and training time. ====<br />
<br />
Insert Table 5<br />
<br />
The above table presents a comparison of our fastText model to other baselines. (This tble needs to be explained, especially what prec@1 means)<br />
<br />
On running fastText for 5 epochs, we compare it to Tagspace results for two sizes of the hidden layer, i.e., 50 and 200. Both models achieve a similar performance with a small hidden layer, while fastText being slightly more accurate. However, the addition of boosts the accuracy by a significant amount(Talk a little more about the accuracy the units and what it implies).<br />
<br />
Finally, at test time, our model performed significantly better. The Tagspace model algorithm calculates the scores for all the classes which takes up a significant amount of time. FastText on the other hand, has a fast inference for a large number of classes which is more than 300k in this data set providing a significant speed-up on the test time. Overall, we are more than an order of magnitude faster to obtain model with a better quality (a 600× speedup). (Rephrase this sentence?) Table 4 shows some qualitative examples.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35110Bag of Tricks for Efficient Text Classification2018-03-22T03:25:23Z<p>Cs3yang: /* Results and training time. */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
* PLACEHOLDER: we should look at when this <br />
<br />
=== Natural-Language Processing ===<br />
<br />
* Briefly describe the difference between NLP and text-mining. Maybe comment later about whether fastText accomplishes NLP.<br />
<br />
<br />
=== Char-CNN ===<br />
<br />
<br />
=== Char-CRNN ===<br />
<br />
Another paper by Xiao and Cho (2016) proposes a convolutional recurrent neural network (CRNN) for text classification after inspiration from the paper by Zhang et al., 2015 which proposes a CNN. Their goal is to show that they can have a smaller model that can achieve similar text classification performance with a single recurrent layer on top of a CNN that can capture long term dependencies in a document more efficiently.<br />
<br />
The recurrent layer consists of either gated recurrent units or long short-term memory units, so it can capture long-term dependencies. As a result, the network only needs a very small number of convolutional layers. However, the recurrent layer is computationally expensive.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. <br />
<br />
A bidirectional recurrent layer can also be used which is composed of two recurrent layers working in opposite directions. This is to alleviate an imbalance in the amount of information seen by the hidden state at different time steps. This layer returns two sequences of hidden states from the forward and reverse reccurent layers.<br />
<br />
Brief Summary of their Model:<br />
<br />
A one-hot sequence input <math>(x_1, x_2, ,..., x_T) </math> is turned into a sequence of dense, real valued vectors <math> E = (e_1, e_2, ..., e_T) </math> using the embedding layer. After, multiple convolutional layers are applied to <math> E </math> to get a shorter sequence of feature vectors: <math> F = (f_1, f_2, ...,f_{T'}) </math> . This feature vector is then fed into a bidirectional recurrent layer, resulting in two sequences <math>H_{forward} = (\vec{h_1}, \vec{h_2},...,\vec{h_{T'}})</math>, <math>H_{reverse} = (\overleftarrow{h_1}, \overleftarrow{h_2},...,\overleftarrow{h_{T'}})</math>. The last hidden states were taken of both directions and were concatenated to form a fixed dimensional vector, <math> h = [\vec{h_{T'}}; \overleftarrow{h_1}]</math>, which is fed into the classification layer to compute the predictive probabilities <math> p(y = k|X) </math> of all k classes given the input sequence <math> X </math>.<br />
<br />
<br />
[[File:xiao and cho 2016 pic.png]]<br />
Graphical illustration of (a) a CNN and (b) proposed CRNN for character-level document classification <br />
<br />
Source: [Xiao and Cho2016] Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv: 1602.00367.<br />
<br />
=== VDCNN ===<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of a ____ as each possible label and outputs probability estimates for each label. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single ____ to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for the _____. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed in the lower leaves of the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path of the random walk for more frequently labelled classes. Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>. <br />
<br />
A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf. </i><br />
<br />
In the below figure, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
[[File:figure2.jpg|Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here the, spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
*PLACEHOLDER FOR TABLE 4<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags. The precision score was 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]).<br />
<br />
==== Baseline ====<br />
The baseline used is frequency based, predicting the most frequent tag.(What does this do and how does it help?)<br />
<br />
For comparison purposes we looked into another tag prediction model, Tagspace. It is similar to our model but is based on the Wsabie model of Weston et al. The Tagspace model is described using convolutions. For faster yet comparable performance we consider the linear version of this model. (Tagspace needs to be explained properly).<br />
<br />
====Results and training time. ====<br />
<br />
Insert Table 5<br />
<br />
The above table presents a comparison of our fastText model to other baselines. (This tble needs to be explained, especially what prec@1 means)<br />
<br />
On running fastText for 5 epochs, we compare it to Tagspace results for two sizes of the hidden layer, i.e., 50 and 200. Both models achieve a similar performance with a small hidden layer, while fastText being slightly more accurate. However, the addition of boosts the accuracy by a significant amount(Talk a little more about the accuracy the units and what it implies).<br />
<br />
Finally, at test time, our model performed significantly better. The Tagspace model algorithm calculates the scores for all the classes which takes up a significant amount of time. FastText on the other hand, has a fast inference for a large number of classes which is more than 300k in this data set providing a significant speed-up on the test time. Overall, we are more than an order of magnitude faster to obtain model with a better quality (a 600× speedup). (Rephrase this sentence?) Table 4 shows some qualitative examples.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35096Bag of Tricks for Efficient Text Classification2018-03-22T02:58:07Z<p>Cs3yang: /* Dataset */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
* PLACEHOLDER: we should look at when this <br />
<br />
=== Natural-Language Processing ===<br />
<br />
* Briefly describe the difference between NLP and text-mining. Maybe comment later about whether fastText accomplishes NLP.<br />
<br />
<br />
=== Char-CNN ===<br />
<br />
<br />
=== Character-Level Convolution Recurrent Neural Networks for Text Classification ===<br />
<br />
Another paper by Xiao and Cho (2016) proposes a convolutional recurrent neural network (CRNN) after inspiration from the paper by Zhang et al., 2015 which proposes a CNN. Their goal is to show that they can have a smaller model that can achieve similar text classification performance with a single recurrent layer on top of a CNN that can capture long term dependencies in a document more efficiently.<br />
<br />
The recurrent layer consists of either gated recurrent units or long short-term memory units, so it can capture long-term dependencies. As a result, the network only needs a very small number of convolutional layers. However, the recurrent layer is computationally expensive.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math> which takes as input one input vector and the previous hidden state, and returns the new hidden state <math> h_t = f(x_t, h_{t-1})</math>, where <math> x_t \epsilon \mathbb{R}^d</math> is one time step from the input sequence <math> (x_1, x_2, ,..., x_T)</math>. <br />
<br />
A bidirectional recurrent layer can also be used which is composed of two recurrent layers working in opposite directions. This is to alleviate an imbalance in the amount of information seen by the hidden state at different time steps. This layer returns two sequences of hidden states from the forward and reverse reccurent layers.<br />
<br />
Brief Summary of their Model:<br />
<br />
A one-hot sequence input <math>(x_1, x_2, ,..., x_T) </math> is turned into a sequence of dense, real valued vectors <math> E = (e_1, e_2, ..., e_T) </math> using the embedding layer. After, multiple convolutional layers are applied to <math> E </math> to get a shorter sequence of feature vectors: <math> F = (f_1, f_2, ...,f_{T'}) </math> . This feature vector is then fed into a bidirectional recurrent layer, resulting in two sequences <math>H_{forward} = (\vec{h_1}, \vec{h_2},...,\vec{h_{T'}})</math>, <math>H_{reverse} = (\overleftarrow{h_1}, \overleftarrow{h_2},...,\overleftarrow{h_{T'}})</math>. The last hidden states were taken of both directions and were concatenated to form a fixed dimensional vector, <math> h = [\vec{h_{T'}}; \overleftarrow{h_1}]</math>, which is fed into the classification layer to compute the predictive probabilities <math> p(y = k|X) </math> of all k classes.<br />
<br />
=== VDCNN ===<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of a ____ as each possible label and outputs probability estimates for each label. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single ____ to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for the _____. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed in the lower leaves of the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path of the random walk for more frequently labelled classes. A probability for each path, whether we are travelling right or left from a node, is calculated. This is done by applying the sigmoid function to the product of the output vector <math>v_{n_i}</math> of each inner node <math> n </math> and the output value of the hidden layer of the model, <math>h</math>. <br />
<br />
The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf. </i><br />
<br />
[[File:figure2.jpg|thumb|center|Figure 2: Binary Tree Example for the Hierarchical Softmax Model]]<br />
<br />
In the Figure 2, we can see an example of a binary tree for the hierarchical softmax model. An example path from root node <math>n_1</math> to label 2 is highlighted in blue. In this case we can see that each path has an associated probability calculation and the total probability of label 2 is in line with the class probability calculation above.<br />
<br />
Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>.<br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here the, spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Table 4 shows an example of 5 items in the validation set with their Inputs (title and caption), Prediction (the tag class they are classified to) and Tags (real image tags), highlighting when the predicted class is in fact one of the tags. <br />
<br />
*PLACEHOLDER FOR TABLE 4<br />
<br />
Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. After cleaning, the vocabulary size is 297,141 and there are 312,116 unique tags. The precision score was 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]).<br />
<br />
==== Baseline ====<br />
The baseline used is frequency based, predicting the most frequent tag.(What does this do and how does it help?)<br />
<br />
For comparison purposes we looked into another tag prediction model, Tagspace. It is similar to our model but is based on the Wsabie model of Weston et al. The Tagspace model is described using convolutions. For faster yet comparable performance we consider the linear version of this model. (Tagspace needs to be explained properly).<br />
<br />
====Results and training time. ====<br />
<br />
Insert Table 5<br />
<br />
The above table presents a comparison of our fastText model to other baselines. <br />
<br />
On running fastText for 5 epochs, we compare it to Tagspace results for two sizes of the hidden layer, i.e., 50 and 200. Both models achieve a similar performance with a small hidden layer, while fastText being slightly more accurate. However, the addition of boosts the accuracy by a significant amount(Talk a little more about the accuracy the units and what it implies).<br />
<br />
Finally, at test time, our model performed significantly better. The Tagspace model algorithm calculates the scores for all the classes which takes up a significant amount of time. FastText on the other hand, has a fast inference for a large number of classes which is more than 300k in this data set providing a significant speed-up on the test time. Overall, we are more than an order of magnitude faster to obtain model with a better quality (a 600× speedup). (Rephrase this sentence?) Table 4 shows some qualitative examples.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35086Bag of Tricks for Efficient Text Classification2018-03-22T02:48:47Z<p>Cs3yang: /* Baseline */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
* PLACEHOLDER: we should look at when this <br />
<br />
=== Natural-Language Processing ===<br />
<br />
* Briefly describe the difference between NLP and text-mining. Maybe comment later about whether fastText accomplishes NLP.<br />
<br />
<br />
=== Char-CNN ===<br />
<br />
<br />
=== Character-Level Convolution Recurrent Neural Networks for Text Classification ===<br />
<br />
Another paper by Xiao and Cho (2016) proposes a convolutional recurrent neural network (CRNN) after inspiration from the paper by Zhang et al., 2015 which proposes a CNN. Their goal is to show that they can have a smaller model that can achieve similar text classification performance with a single recurrent layer on top of a CNN that can capture long term dependencies in a document more efficiently.<br />
<br />
The recurrent layer consists of either gated recurrent units or long short-term memory units, so it can capture long-term dependencies. As a result, the network only needs a very small number of convolutional layers. However, the recurrent layer is computationally expensive.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math><br />
<br />
<br />
A one-hot sequence input is turned into a sequence of dense, real valued vectors using the embedding layer. After, multiple convolutiona layers are applied to get a shorter sequence of feature vectors. This feature vector is then fed into a bidirectional recurrent layer, resulting in two sequences. The last hidden states were taken of both directions and were concatenated to form a fixed dimensional vector which is fed into the classification layer to compute the predictive probabilities of all classes.<br />
<br />
=== VDCNN ===<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of a ____ as each possible label and outputs probability estimates for each label. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single ____ to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for the _____. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed in the lower leaves of the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path of the random walk for more frequently labelled classes. A probability for each path, whether we are travelling right or left from a node, is calculated using the sigmoid function.<br />
The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf. </i><br />
<br />
Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>.<br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here the, spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. The vocabulary size is 297,141 and there are 312,116 unique tags. The precision score was 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]). <br />
<br />
==== Baseline ====<br />
The baseline used is frequency based, predicting the most frequent tag.(What does this do and how does it help?)<br />
<br />
For comparison purposes we looked into another tag prediction model, Tagspace. It is similar to our model but is based on the Wsabie model of Weston et al. The Tagspace model is described using convolutions. For faster yet comparable performance we consider the linear version of this model. (Tagspace needs to be explained properly).<br />
<br />
====Results and training time. ====<br />
<br />
Insert Table 5<br />
<br />
The above table presents a comparison of our fastText model to other baselines. <br />
<br />
On running fastText for 5 epochs, we compare it to Tagspace results for two sizes of the hidden layer, i.e., 50 and 200. Both models achieve a similar performance with a small hidden layer, while fastText being slightly more accurate. However, the addition of boosts the accuracy by a significant amount(Talk a little more about the accuracy the units and what it implies).<br />
<br />
Finally, at test time, our model performed significantly better. The Tagspace model algorithm calculates the scores for all the classes which takes up a significant amount of time. FastText on the other hand, has a fast inference for a large number of classes which is more than 300k in this data set providing a significant speed-up on the test time. Overall, we are more than an order of magnitude faster to obtain model with a better quality (a 600× speedup). (Rephrase this sentence?) Table 4 shows some qualitative examples.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35085Bag of Tricks for Efficient Text Classification2018-03-22T02:46:24Z<p>Cs3yang: /* Dataset */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
* PLACEHOLDER: we should look at when this <br />
<br />
=== Natural-Language Processing ===<br />
<br />
* Briefly describe the difference between NLP and text-mining. Maybe comment later about whether fastText accomplishes NLP.<br />
<br />
<br />
=== Char-CNN ===<br />
<br />
<br />
=== Character-Level Convolution Recurrent Neural Networks for Text Classification ===<br />
<br />
Another paper by Xiao and Cho (2016) proposes a convolutional recurrent neural network (CRNN) after inspiration from the paper by Zhang et al., 2015 which proposes a CNN. Their goal is to show that they can have a smaller model that can achieve similar text classification performance with a single recurrent layer on top of a CNN that can capture long term dependencies in a document more efficiently.<br />
<br />
The recurrent layer consists of either gated recurrent units or long short-term memory units, so it can capture long-term dependencies. As a result, the network only needs a very small number of convolutional layers. However, the recurrent layer is computationally expensive.<br />
<br />
The recurrent layer consists of a recursive function <math>f</math><br />
<br />
<br />
A one-hot sequence input is turned into a sequence of dense, real valued vectors using the embedding layer. After, multiple convolutiona layers are applied to get a shorter sequence of feature vectors. This feature vector is then fed into a bidirectional recurrent layer, resulting in two sequences. The last hidden states were taken of both directions and were concatenated to form a fixed dimensional vector which is fed into the classification layer to compute the predictive probabilities of all classes.<br />
<br />
=== VDCNN ===<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of a ____ as each possible label and outputs probability estimates for each label. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single ____ to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for the _____. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed in the lower leaves of the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path of the random walk for more frequently labelled classes. A probability for each path, whether we are travelling right or left from a node, is calculated using the sigmoid function.<br />
The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf. </i><br />
<br />
Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>.<br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here the, spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. The vocabulary size is 297,141 and there are 312,116 unique tags. The precision score was 1, meaning that all items classified to a tag did in fact belong to that class (Source: [https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(classification_context) Wikipedia]). <br />
<br />
==== Baseline ====<br />
The baseline used is frequency based, predicting the most frequent tag.(What does this do and how does it help?)<br />
<br />
For comparison purposes we looked into another tag prediction model, Tagspace. It is similar to our model but is based on the Wsabie model of Weston et al. The Tagspace model is described using convolutions. For faster yet comparable performance we consider the linear version of this model.<br />
<br />
====Results and training time. ====<br />
<br />
Insert Table 5<br />
<br />
The above table presents a comparison of our fastText model to other baselines. <br />
<br />
On running fastText for 5 epochs, we compare it to Tagspace results for two sizes of the hidden layer, i.e., 50 and 200. Both models achieve a similar performance with a small hidden layer, while fastText being slightly more accurate. However, the addition of boosts the accuracy by a significant amount(Talk a little more about the accuracy the units and what it implies).<br />
<br />
Finally, at test time, our model performed significantly better. The Tagspace model algorithm calculates the scores for all the classes which takes up a significant amount of time. FastText on the other hand, has a fast inference for a large number of classes which is more than 300k in this data set providing a significant speed-up on the test time. Overall, we are more than an order of magnitude faster to obtain model with a better quality (a 600× speedup). (Rephrase this sentence?) Table 4 shows some qualitative examples.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35076Bag of Tricks for Efficient Text Classification2018-03-22T01:54:23Z<p>Cs3yang: /* Feature Hashing */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
* PLACEHOLDER: we should look at when this <br />
<br />
=== Natural-Language Processing ===<br />
<br />
* Briefly describe the difference between NLP and text-mining. Maybe comment later about whether fastText accomplishes NLP.<br />
<br />
<br />
=== Char-CNN ===<br />
<br />
<br />
=== Char-CRNN ===<br />
<br />
<br />
=== VDCNN ===<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
Traditionally, text classification methods were centered around linear classifiers. Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples will often get classified in a large output field. The authors of this paper will attempt to improve on the performance of linear classifiers with the key features of rank constraint and fast loss approximation. Before we get into that, we must better understand the idea of model training of linear classifiers.<br />
<br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
A softmax function returns a probability that a text is associated with label j, with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
,where x represents the text vector and w represents the label vector.<br />
<br />
Now let’s look at the improved method of linear classifiers with a rank constraint and fat loss approximation. Refer to the image below.<br />
<br />
[[File:model image.png]]<br />
<br />
Using the weight matrices, the ngram features of the input are first looked up to find word representations, then averaged into hidden text representation. It is then fed to a linear classifier. Finally, the softmax function is used to compute the probability distribution over the predefined classes. For a set of N documents, the model minimizes the negative log likelihood over the classes. The classifier trains on multiple CPUs using stochastic gradient descent and a linearly decaying learning rate. <br />
<br />
[[File:formula explained.png]]<br />
<br />
Two changes that were applied in this model architecture are the hierarchical softmax function, which improves performance with a large number of classes, and the hashing trick to manage mappings of n-grams to local word order. These two nuances will be more thoroughly explained in the upcoming sections.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of a ____ as each possible label and outputs probability estimates for each label. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single ____ to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for the _____. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed in the lower leaves of the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path of the random walk for more frequently labelled classes. A probability for each path, whether we are travelling right or left from a node, is calculated using the sigmoid function.<br />
The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf. </i><br />
<br />
Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>.<br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(``\text{cats"}) = 3 </math>. Considering the sentence <math> `` \text{ I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = [``I", ``love", ``cats", ``but", ``Mary", ``hate", ``cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi(``I") = 1 \text{ and } \xi(``Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here the, spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. The vocabulary size is 297,141 and there are 312,116 unique tags. The precision is 1 (What does this mean). The baseline used is frequency based, predicting the most frequent tag.(What does this do and how does it help?)<br />
<br />
For comparison purposes we looked into another tag prediction model, Tagspace. It is similar to our model but is based on the Wsabie model of Weston et al. The Tagspace model is described using convolutions. For faster yet comparable performance we consider the linear version of this model.<br />
<br />
<br />
====Results and training time. ====<br />
<br />
Insert Table 5<br />
<br />
The above table presents a comparison of our fastText model to other baselines. <br />
<br />
On running fastText for 5 epochs, we compare it to Tagspace results for two sizes of the hidden layer, i.e., 50 and 200. Both models achieve a similar performance with a small hidden layer, while fastText being slightly more accurate. However, the addition of boosts the accuracy by a significant amount(Talk a little more about the accuracy the units and what it implies).<br />
<br />
Finally, at test time, our model performed significantly better. The Tagspace model algorithm calculates the scores for all the classes which takes up a significant amount of time. FastText on the other hand, has a fast inference for a large number of classes which is more than 300k in this data set providing a significant speed-up on the test time. Overall, we are more than an order of magnitude faster to obtain model with a better quality (a 600× speedup). (Rephrase this sentence?) Table 4 shows some qualitative examples.<br />
<br />
== Conclusion ==<br />
The authors propose a simple baseline method for text classification that perform well on large-scale datasets. Word features are averaged together to represent sentences which is then fed to a linear classifier. The authors experiment and test their model fastText against other models with evaluation protocol similar to Zhang et al. (2015) for sentiment analysis and then evaluated on its ability to scale to large output on a tag prediction data set. fastText achieved comparable results in terms of accuracy and was found to train significantly faster.<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35059Bag of Tricks for Efficient Text Classification2018-03-21T22:50:03Z<p>Cs3yang: /* Experiment */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
* PLACEHOLDER: we should look at when this <br />
<br />
=== Natural-Language Processing ===<br />
<br />
* Briefly describe the difference between NLP and text-mining. Maybe comment later about whether fastText accomplishes NLP.<br />
<br />
<br />
* might want to briefly mention types of models that are used in the experiment for comparison<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
[[File:model image.png]]<br />
<br />
An efficient standard for sentence classification can be created by representing sentences as bag of words (BoW) and training a linear classifier, such as a logistic regression or a soft vector machine (SVM).<br />
<br />
Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples (low frequency) will often get classified in a large output field. The model in the paper is built on top of a linear model with a rank constraint and a fast loss approximation. The image above illustrates a simple linear model with a rank constraint.<br />
<br />
Some of the most common solutions to this problem are to either factorize the linear classifier into low rank matrices or to use multi-layer neural networks<br />
<br />
To better understand the idea of model training of linear classifier, I will proceed to explain it more thoroughly. <br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
The probability that the softmax function returns for a text with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_k}} </math><br />
<br />
<br />
However, fastText uses hierarchical softmax in order to significantly reduce computational complexity, which in turn reduces the running time as well. This will be more thoroughly explained in the next section.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
As mentioned above, the softmax function is used to compute the probability density over predefined classes. It calculates the probability of a ____ as each possible label and outputs probability estimates for each label. Due to the nature of the softmax function, the denominator serves to normalize the probabilities, allowing a single ____ to receive probabilities for each label, where these probabilities sum to one. This provides a means to choose the highest probability as the corresponding label for the _____. <br />
<br />
However, the softmax function does have a computational complexity of <math> O(Kd) </math> where <math>K</math> is the number of classes and <math>d</math> is the number of dimensions in the hidden layer of the neural network. This is due to the nature of the softmax function since each function calculation requires normalizing the probabilities over all potential classes. This runtime is not ideal when the number of classes is large, and for this reason a hierarchical softmax function is used. We can see the differences in computational efficiency in the following set-up for hierarchical softmax. <br />
<br />
Suppose we have a binary tree structure based on Huffman coding for the softmax function, where each node has at most two children or leaves. Huffman coding trees provide a means to optimize binary trees where the classes with lowest frequencies are placed in the lower leaves of the tree and the highest frequency classes are placed near the root of the tree, which minimizes the path of the random walk for more frequently labelled classes. A probability for each path, whether we are travelling right or left from a node, is calculated using the sigmoid function.<br />
The idea of this method is to represent the output classes as the leaves on this tree and a random walk then assigns probabilities for these classes based on the path taken from the root of the tree. The probability of a certain class is then calculated as: <br />
<br />
<div style="text-align: center;"> <math> P(n_{l+1}) = \displaystyle\prod_{i=1}^{l} P(n_i) </math> <br> </div><br />
<br />
<i>where <math>n</math> represents the leaf node that a class is located on with depth <math> l+1 </math> and <math> n_1, n_2, …, n_l </math> represents the parent nodes of that leaf. </i><br />
<br />
Huffman coding trees are efficient since computational runtime is reduced to <math> O(d \log_2(K)) </math>.<br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(\text{"cats"}) = 3 </math>. Considering the sentence <math> \text{"I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = ["I", "love", "cats", "but", "Mary", "hate", "cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi("I") = 1 \text{ and } \xi("Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. <br />
<br />
The first classification problem being Sentiment Analysis, where it is compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, our tailored implementation is at least 2-5× faster in practice.<br />
<br />
<br />
<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset====<br />
<br />
Scalability to large datasets is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags for each. Here the, spotlight was on using fastText text classifier to predict the tags associated with each image without actually using the image itself, both rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this classification problem was to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. The train set consists of approximately 90% of the dataset, the validation set consists of approximately 1%, and the test set of 0.5%. Removing infrequently occurring tags and words eliminates some noise and helps our model learn better. A [https://github.com/facebookresearch/fastText script] has been released explaining the breakup of the data. The vocabulary size is 297,141 and there are 312,116 unique tags. The precision is 1 (What does this mean). The baseline used is frequency based, predicting the most frequent tag.(What does this do and how does it help?)<br />
<br />
For comparison purposes we looked into another tag prediction model, Tagspace. It is similar to our model but is based on the Wsabie model of Weston et al. The Tagspace model is described using convolutions. For faster yet comparable performance we consider the linear version of this model.<br />
<br />
<br />
====Results and training time. ====<br />
<br />
Insert Table 5<br />
<br />
The above table presents a comparison of our fastText model to other baselines. <br />
<br />
On running fastText for 5 epochs, we compare it to Tagspace results for two sizes of the hidden layer, i.e., 50 and 200. Both models achieve a similar performance with a small hidden layer, while fastText being slightly more accurate. However, the addition of boosts the accuracy by a significant amount(Talk a little more about the accuracy the units and what it implies).<br />
<br />
Finally, at test time, our model performed significantly better. The Tagspace model algorithm calculates the scores for all the classes which takes up a significant amount of time. FastText on the other hand, has a fast inference for a large number of classes which is more than 300k in this data set providing a significant speed-up on the test time. Overall, we are more than an order of magnitude faster to obtain model with a better quality (a 600× speedup). (Rephrase this sentence?) Table 4 shows some qualitative examples.<br />
<br />
== Conclusion ==<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35033Bag of Tricks for Efficient Text Classification2018-03-21T20:27:35Z<p>Cs3yang: </p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
* PLACEHOLDER: we should look at when this <br />
<br />
=== Natural-Language Processing ===<br />
<br />
* Briefly describe the difference between NLP and text-mining. Maybe comment later about whether fastText accomplishes NLP.<br />
<br />
<br />
* might want to briefly mention types of models that are used in the experiment for comparison<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
[[File:model image.png]]<br />
<br />
An efficient standard for sentence classification can be created by representing sentences as bag of words (BoW) and training a linear classifier, such as a logistic regression or a soft vector machine (SVM).<br />
<br />
Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples (low frequency) will often get classified in a large output field. The model in the paper is built on top of a linear model with a rank constraint and a fast loss approximation. The image above illustrates a simple linear model with a rank constraint.<br />
<br />
Some of the most common solutions to this problem are to either factorize the linear classifier into low rank matrices or to use multi-layer neural networks<br />
<br />
To better understand the idea of model training of linear classifier, I will proceed to explain it more thoroughly. <br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
The probability that the softmax function returns for a text with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_j}} </math><br />
<br />
<br />
However, fastText uses hierarchical softmax in order to significantly reduce computational complexity, which in turn reduces the running time as well. This will be more thoroughly explained in the next section.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
Softmax function ''f'' is used to compute the probability density over the predefined classes. The softmax output layer with log-likelihood is given in the article as:<br />
<br />
* <math> - \frac{1}{N} \sum_{n=1}^N y_n \log ( f(BAx_n))</math> (is this the right one?)<br />
<br />
In this formula. A and B are weight matrix which will be calculated in the training set. <math> X_n </math> is the normalizefeature of the <math> n-th </math> documentation. <math> Y_n </math> is the label.<br />
<br />
<br />
<br />
Remark: Negatively log-likelihood is a multiclass cross-entropy. What this means is that for a binary problem (dog or not dog), it will output two values between [0,1] where the sum of the two values equates to 1. (Dog = 0.6, Cat = 0.4). This can further be expanded into larger dimensions. In contrast, sigmoid outputs one value and in the binary case, the other value can be derived via 1 - p.<br />
<br />
Softmax will have a complexity of O(kh) where k is the number of classes and h is the number of dimensions of text representation. The function that the authors used for their model was a variation of the softmax function, known as '''Hiearchy Softmax'''. The hiearchy softmax is based on the Huffman Coding Tree and will reduce complexity to O(H*log2(k)).<br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n, <math><br />
\omega_1^n =<br />
\omega_1 \cdots \omega_n </math> . By probabilities properties, we can model the probability of the word sequence 1 with Bigram as <math> P(\omega_1^n) =P(<br />
\omega_1) P( \omega_2 | \omega_1) P( \omega_3 | \omega_1^2) \cdots P( \omega_n | \omega_1^{n-1}) </math> . For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-1} )</math>.<br />
<br />
We can generalize this to the stronger case for N-th gram as: <br />
<br />
<math>P(\omega_1^n) = \prod_{k=1}^n P(\omega_k | \omega_{k-(N-1)}^{k-1} )</math>.<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(\text{"cats"}) = 3 </math>. Considering the sentence <math> \text{"I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = ["I", "love", "cats", "but", "Mary", "hate", "cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi("I") = 1 \text{ and } \xi("Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. First classification problem being, Sentiment Analysis, where it is being compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, in practice, our tailored implementation is at least 2-5× faster.<br />
<br />
#fastText was compared with various other text classifiers in two classification problems:<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset and baseline====<br />
<br />
Scalability is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags. Here the, spotlight was on using fastText text classifier to predicting the tags associated with each image without actually using the image itself, rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this was classifier is to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. (Doing so takes out the outliers and helps our model learn better there by reducing the error rate). A script has been released explaining the breakup of the data. The train set contains 91,188,648 examples (1.5B tokens). The validation set has 930,497 examples and the test set 543,424. The vocabulary size is 297,141 and there are 312,116 unique tags(which are the # of classes? Put in an explanation as what these points are). The precision is set at 1 (What does this mean). The baseline used is frequency based, predicting the most frequent tag.(What does this do and how does it help?)<br />
<br />
For comparison purposes we looked into another tag prediction model, Tagspace. It is similar to our model but is based on the Wsabie model of Weston et al. The Tagspace model is described using convolutions. For faster yet comparable performance we consider the linear version of this model.<br />
<br />
<br />
====Results and training time. ====<br />
<br />
Insert Table 5<br />
<br />
The above table presents a comparison of our fastText model to other baselines. <br />
<br />
On running fastText for 5 epochs, we compare it to Tagspace results for two sizes of the hidden layer, i.e., 50 and 200. Both models achieve a similar performance with a small hidden layer, while fastText being slightly more accurate. However, the addition of boosts the accuracy by a significant amount(Talk a little more about the accuracy the units and what it implies).<br />
<br />
Finally, at test time, our model performed significantly better. The Tagspace model algorithm calculates the scores for all the classes which takes up a significant amount of time. FastText on the other hand, has a fast inference for a large number of classes which is more than 300k in this data set providing a significant speed-up on the test time. Overall, we are more than an order of magnitude faster to obtain model with a better quality (a 600× speedup). (Rephrase this sentence?) Table 4 shows some qualitative examples.<br />
<br />
== Conclusion ==<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Sources ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35031Bag of Tricks for Efficient Text Classification2018-03-21T20:14:09Z<p>Cs3yang: </p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
* PLACEHOLDER: we should look at when this <br />
<br />
=== Natural-Language Processing ===<br />
<br />
* Briefly describe the difference between NLP and text-mining. Maybe comment later about whether fastText accomplishes NLP.<br />
<br />
<br />
* might want to briefly mention types of models that are used in the experiment for comparison<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
[[File:model image.png]]<br />
<br />
An efficient standard for sentence classification can be created by representing sentences as bag of words (BoW) and training a linear classifier, such as a logistic regression or a soft vector machine (SVM).<br />
<br />
Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples (low frequency) will often get classified in a large output field. The model in the paper is built on top of a linear model with a rank constraint and a fast loss approximation. The image above illustrates a simple linear model with a rank constraint.<br />
<br />
Some of the most common solutions to this problem are to either factorize the linear classifier into low rank matrices or to use multi-layer neural networks<br />
<br />
To better understand the idea of model training of linear classifier, I will proceed to explain it more thoroughly. <br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
The probability that the softmax function returns for a text with K labels in the training set is: <br />
<br />
<math> P(y=j | \mathbf{x}) = \dfrac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_j}} </math><br />
<br />
<br />
However, fastText uses hierarchical softmax in order to significantly reduce computational complexity, which in turn reduces the running time as well. This will be more thoroughly explained in the next section.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
Softmax function ''f'' is used to compute the probability density over the predefined classes. The softmax output layer with log-likelihood is given in the article as:<br />
<br />
* <math> - \frac{1}{N} \sum_{n=1}^N y_n \log ( f(BAx_n))</math> (is this the right one?)<br />
<br />
In this formula. A and B are weight matrix which will be calculated in the training set. <math> X_n </math> is the normalizefeature of the <math> n-th </math> documentation. <math> Y_n </math> is the label.<br />
<br />
<br />
<br />
Remark: Negatively log-likelihood is a multiclass cross-entropy. What this means is that for a binary problem (dog or not dog), it will output two values between [0,1] where the sum of the two values equates to 1. (Dog = 0.6, Cat = 0.4). This can further be expanded into larger dimensions. In contrast, sigmoid outputs one value and in the binary case, the other value can be derived via 1 - p.<br />
<br />
Softmax will have a complexity of O(kh) where k is the number of classes and h is the number of dimensions of text representation. The function that the authors used for their model was a variation of the softmax function, known as '''Hiearchy Softmax'''. The hiearchy softmax is based on the Huffman Coding Tree and will reduce complexity to O(H*log2(k)).<br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n. [[File:1aaa.png ]]. By probabilities properties, we can model the probability of the word sequence 1 with Bigram as [[File:2aaa.png ]]. For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: [[File:3aaa.png ]].<br />
<br />
We can generalize this to the stronger case for N-th gram as [[File:4aaa.png ]].<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(\text{"cats"}) = 3 </math>. Considering the sentence <math> \text{"I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = ["I", "love", "cats", "but", "Mary", "hate", "cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi("I") = 1 \text{ and } \xi("Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. First classification problem being, Sentiment Analysis, where it is being compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, in practice, our tailored implementation is at least 2-5× faster.<br />
<br />
#fastText was compared with various other text classifiers in two classification problems:<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset and baseline====<br />
<br />
Scalability is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags. Here the, spotlight was on using fastText text classifier to predicting the tags associated with each image without actually using the image itself, rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this was classifier is to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. (Doing so takes out the outliers and helps our model learn better there by reducing the error rate). A script has been released explaining the breakup of the data. The train set contains 91,188,648 examples (1.5B tokens). The validation set has 930,497 examples and the test set 543,424. The vocabulary size is 297,141 and there are 312,116 unique tags(which are the # of classes? Put in an explanation as what these points are). The precision is set at 1 (What does this mean). The baseline used is frequency based, predicting the most frequent tag.(What does this do and how does it help?)<br />
<br />
For comparison purposes we looked into another tag prediction model, Tagspace. It is similar to our model but is based on the Wsabie model of Weston et al. The Tagspace model is described using convolutions. For faster yet comparable performance we consider the linear version of this model.<br />
<br />
<br />
====Results and training time. ====<br />
<br />
Insert Table 5<br />
<br />
The above table presents a comparison of our fastText model to other baselines. <br />
<br />
On running fastText for 5 epochs, we compare it to Tagspace results for two sizes of the hidden layer, i.e., 50 and 200. Both models achieve a similar performance with a small hidden layer, while fastText being slightly more accurate. However, the addition of boosts the accuracy by a significant amount(Talk a little more about the accuracy the units and what it implies).<br />
<br />
Finally, at test time, our model performed significantly better. The Tagspace model algorithm calculates the scores for all the classes which takes up a significant amount of time. FastText on the other hand, has a fast inference for a large number of classes which is more than 300k in this data set providing a significant speed-up on the test time. Overall, we are more than an order of magnitude faster to obtain model with a better quality (a 600× speedup). (Rephrase this sentence?) Table 4 shows some qualitative examples.<br />
<br />
== Conclusion ==<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=35030Bag of Tricks for Efficient Text Classification2018-03-21T20:10:58Z<p>Cs3yang: </p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Text Classification is utilized by millions of web users on a daily basis. An example of an application of text classification is web search and content ranking. When a user searches a specific word that best describes the content they are looking for, text classification helps with categorizing the appropriate content. <br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier, which is inexpensive in terms of training and test time, can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance.<br />
<br />
The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
* PLACEHOLDER: we should look at when this <br />
<br />
=== Natural-Language Processing ===<br />
<br />
* Briefly describe the difference between NLP and text-mining. Maybe comment later about whether fastText accomplishes NLP.<br />
<br />
<br />
* might want to briefly mention types of models that are used in the experiment for comparison<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
[[File:model image.png]]<br />
<br />
An efficient standard for sentence classification can be created by representing sentences as bag of words (BoW) and training a linear classifier, such as a logistic regression or a soft vector machine (SVM).<br />
<br />
Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples (low frequency) will often get classified in a large output field. The model in the paper is built on top of a linear model with a rank constraint and a fast loss approximation. The image above illustrates a simple linear model with a rank constraint.<br />
<br />
Some of the most common solutions to this problem are to either factorize the linear classifier into low rank matrices or to use multi-layer neural networks<br />
<br />
To better understand the idea of model training of linear classifier, I will proceed to explain it more thoroughly. <br />
Consider each text and each label as a vector in space. The model is training the coordinates of the text vector, in order for the text vector to be close to the vector of its associated label. The text vector and its label vector is inputted into the softmax function, which returns a score. The score is then normalized across the score for that same text with every other possible label. The result is the probability that the text will have its associated label. Then stochastic gradient descent algorithm is used to keep updating the coordinates until the probability of correct label for every text is maximized. This is clearly computationally expensive, as the score for every possible label in the training set must be computed for a text. <br />
<br />
The probability that the softmax function returns for a text with K labels in the training set is: <br />
* <math> P(y=j | x) = \frac{e^{\mathbf{x}^T \mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^T \mathbf{w}_j}}} </math><br />
<br />
[[File:softmax function.png]]<br />
<br />
However, fastText uses hierarchical softmax in order to significantly reduce computational complexity, which in turn reduces the running time as well. This will be more thoroughly explained in the next section.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
Softmax function ''f'' is used to compute the probability density over the predefined classes. The softmax output layer with log-likelihood is given in the article as:<br />
<br />
* <math> - \frac{1}{N} \sum_{n=1}^N y_n \log ( f(BAx_n))</math> (is this the right one?)<br />
<br />
In this formula. A and B are weight matrix which will be calculated in the training set. <math> X_n </math> is the normalizefeature of the <math> n-th </math> documentation. <math> Y_n </math> is the label.<br />
<br />
<br />
<br />
Remark: Negatively log-likelihood is a multiclass cross-entropy. What this means is that for a binary problem (dog or not dog), it will output two values between [0,1] where the sum of the two values equates to 1. (Dog = 0.6, Cat = 0.4). This can further be expanded into larger dimensions. In contrast, sigmoid outputs one value and in the binary case, the other value can be derived via 1 - p.<br />
<br />
Softmax will have a complexity of O(kh) where k is the number of classes and h is the number of dimensions of text representation. The function that the authors used for their model was a variation of the softmax function, known as '''Hiearchy Softmax'''. The hiearchy softmax is based on the Huffman Coding Tree and will reduce complexity to O(H*log2(k)).<br />
<br />
== N-Gram, Bag of Words, and TFIDF ==<br />
=== Bag of Words===<br />
<br />
'''Bag of word''' is an algorithm for simplifying a text dataset by counting how many times a word appears in a document. The n most frequent words are extracted from the training subset to be used as the “dictionary” for the testing set. This dictionary allow us to compare document for document classification and topic modeling. This is one of the method that the authors used for preparing text for input. Each vector of word count is normalized such that all the elements of the vector adds up to one (taking the frequency percentage of the word). If these frequencies exceeds a certain level it will activate nodes in neural network and influence classification.<br />
<br />
The main '''weakness''' of bag of word is that it losses information due to it being single word and invariant to order. We will demonstrate that shortly. Bag of word will also have high error percentage if the training set does not include the entire dictionary of the testing set. <br />
<br />
=== N-Gram ===<br />
'''N-gram''' is another model for simplifying text replication by storing n-local words adjacent to the initial word (or character, N-gram can be character based. Each words in the document is read one at a time just like bag of words, however a certain range of its neighbors will also be scanned as well. This range is known as the n-grams. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. <br />
<br />
In the picture above, it gives an example of a Unigram (1-gram) which is the absolute simplest possible version of this model. Unigram does not consider previous words and just chooses random words based on how common they are in general. It also shows a Bigram (2-gram), where the previous word is considered. Trigram would consider the previous two words, etc etc. Up to N-grams, where it considers N-1 previous words.<br />
<br />
Let a sentence be denoted as a product of words 1 to word n. [[File:1aaa.png ]]. By probabilities properties, we can model the probability of the word sequence 1 with Bigram as [[File:2aaa.png ]]. For example, take the sentence, "How long can this go on?" We can model it as followed:<br />
<br />
P(How long can this go on?”)= P(How)P(long | How)P(can | long)P(this | can)P(go | this)P(on | go)P(? | on)<br />
<br />
Going back to the chain event probability. We can reduce the above equation as the Product of the conditional probabilities as follows for the Bigram case: [[File:3aaa.png ]].<br />
<br />
We can generalize this to the stronger case for N-th gram as [[File:4aaa.png ]].<br />
<br />
<br />
The '''weakness''' with N-gram is that many times local context does not provide any useful predictive clues. For example, if you want the model to learn plural usage of the following sentence:<br />
<br />
The '''woman''' who lives on the fifth floor of the apartment '''is''' pretty.<br />
The '''women''' who lives on the fifth floor of the apartment '''are''' pretty.<br />
<br />
You will need to use 11-th gram and it is very unfeasible for ordinary machines. Which brings us to the next problem, as N increases, the predictive power of the model increases, however the number of parameters required grows exponentially with the number of words prior context.<br />
<br />
<br />
=== BoW, Unigram, Bigram Example ===<br />
An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
=== Feature Hashing ===<br />
Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.<br />
<br />
A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function <math> h </math> that maps features to the value of corresponding dictionary key.<br />
{| class="wikitable"<br />
|-<br />
! Key !! Index<br />
|-<br />
| I<br />
| 0<br />
|-<br />
| love<br />
| 1<br />
|-<br />
| hate<br />
| 2<br />
|-<br />
| cats<br />
| 3<br />
|-<br />
| dogs<br />
| 4<br />
|-<br />
| but<br />
| 5<br />
|-<br />
| Mary<br />
| 6<br />
|-<br />
|}<br />
In this case, <math> h(\text{"cats"}) = 3 </math>. Considering the sentence <math> \text{"I love cats, but Mary hate cats"} </math> and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence <math> x = ["I", "love", "cats", "but", "Mary", "hate", "cats"] </math>. Consider the hashed feature map <math> \phi </math> is calculated by<br />
<br />
<math> \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 </math>, where <math> i </math> is the corresponding index of the hashed feature map.<br />
<br />
By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be <br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 1<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 1<br />
|-<br />
|}<br />
<br />
There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.<br />
<br />
Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 2<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
In order to get an unbiased estimate, the paper uses a signed hash kernel as introduced in [https://alex.smola.org/papers/2009/Weinbergeretal09.pdf Weinberger et al.2009], which introduces another hash function <math> \xi </math> to determine the sign of the return index. The hashed feature map <math> \phi </math> now becomes<br />
<br />
<math> \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 </math><br />
<br />
Consider if <math> \xi("I") = 1 \text{ and } \xi("Mary") = -1 </math>, then our signed hash map now becomes:<br />
{| class="wikitable"<br />
|-<br />
! 0 !! 1 !! 2 !! 3 !! 4 !! 5 !! 6<br />
|-<br />
| 0<br />
| 1<br />
| 1<br />
| 2<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
|}<br />
<br />
Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.<br />
<br />
=== TF-IDF ===<br />
For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called TFIDF, which is the short cut for '''term frequency–inverse document frequency'''. It represent the importance of a word to the document.<br />
<br />
'''Term Frequency(TF)''' generally measures the times that a word occurs in a document. An '''Inverse Document Frequency(IDF)''' can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".<br />
<br />
TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as <math>\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)</math><br />
<br />
In this paper, TFIDF is calculated in the same way as [https://arxiv.org/pdf/1509.01626.pdf Zhang et al., 2015], with<br />
* <math> \mathrm{tf}(t,d) = f_{t,d} </math>, where <math> f_{t,d} </math> is the raw count of <math> t </math> for document <math> d </math>.<br />
* <math> \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) </math>, where <math> N </math> is the total number of documents and <math> | \{d\in D:t\in d \} | </math> is the total number of documents that contains word <math> t </math>.<br />
<br />
== Experiment ==<br />
In this experiment fastText was compared on two classification problems with various other text classifiers. First classification problem being, Sentiment Analysis, where it is being compared to the existing text classifiers. Second, we evaluate fastText to a larger output space on a tag prediction dataset.<br />
<br />
The Vowpal Wabbit library, written in C++ can also be used to implement our model. However, compared to this library, in practice, our tailored implementation is at least 2-5× faster.<br />
<br />
#fastText was compared with various other text classifiers in two classification problems:<br />
* Sentiment Analysis<br />
*Tag Prediction <br />
<br />
=== Tag prediction ===<br />
<br />
====Dataset and baseline====<br />
<br />
Scalability is an important feature of a model. In order to test that, evaluation was carried the YFCC100M dataset which consists of approximately 100 million images containing captions, titles and tags. Here the, spotlight was on using fastText text classifier to predicting the tags associated with each image without actually using the image itself, rather the information associated with the image such as the title and caption of the image.<br />
<br />
The methodology behind this was classifier is to remove the words and tags that occur less than 100 times and split the data into a train, validation and test set. (Doing so takes out the outliers and helps our model learn better there by reducing the error rate). A script has been released explaining the breakup of the data. The train set contains 91,188,648 examples (1.5B tokens). The validation set has 930,497 examples and the test set 543,424. The vocabulary size is 297,141 and there are 312,116 unique tags(which are the # of classes? Put in an explanation as what these points are). The precision is set at 1 (What does this mean). The baseline used is frequency based, predicting the most frequent tag.(What does this do and how does it help?)<br />
<br />
For comparison purposes we looked into another tag prediction model, Tagspace. It is similar to our model but is based on the Wsabie model of Weston et al. The Tagspace model is described using convolutions. For faster yet comparable performance we consider the linear version of this model.<br />
<br />
<br />
====Results and training time. ====<br />
<br />
Insert Table 5<br />
<br />
The above table presents a comparison of our fastText model to other baselines. <br />
<br />
On running fastText for 5 epochs, we compare it to Tagspace results for two sizes of the hidden layer, i.e., 50 and 200. Both models achieve a similar performance with a small hidden layer, while fastText being slightly more accurate. However, the addition of boosts the accuracy by a significant amount(Talk a little more about the accuracy the units and what it implies).<br />
<br />
Finally, at test time, our model performed significantly better. The Tagspace model algorithm calculates the scores for all the classes which takes up a significant amount of time. FastText on the other hand, has a fast inference for a large number of classes which is more than 300k in this data set providing a significant speed-up on the test time. Overall, we are more than an order of magnitude faster to obtain model with a better quality (a 600× speedup). (Rephrase this sentence?) Table 4 shows some qualitative examples.<br />
<br />
== Conclusion ==<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=34370Bag of Tricks for Efficient Text Classification2018-03-15T18:42:54Z<p>Cs3yang: </p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier which is inexpensive in terms of training and test time can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance. The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Background ==<br />
<br />
* PLACEHOLDER: we should look at when this <br />
<br />
=== Natural-Language Processing ===<br />
<br />
* Briefly describe the difference between NLP and text-mining. Maybe comment later about whether fastText accomplishes NLP. <br />
<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
*PLACE HOLDER FOR IMAGE FROM ARTICLE*<br />
Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples (low frequency) will often get classified in a large output field. The model in the paper is built on top of a linear model with a rank constraint and a fast loss approximation. <br />
<br />
<br />
Each <math> N </math> represents a seperate <math> N </math>-th gram features in the sentence. This feature will be explained in a coming section.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
Softmax function ''f'' is used to compute the probability density over the predefined classes. The softmax output layer with log-likelihood is given in the article as:<br />
<br />
* <math> - \frac{1}{N} \sum_{n=1}^N y_n \log ( f(BAx_n))</math> (is this the right one?)<br />
<br />
In this formula. A and B are weight matrix which will be calculated in the training set. <math> X_n </math> is the normalizefeature of the <math> n-th </math> documentation. <math> Y_n </math> is the label.<br />
<br />
<br />
<br />
Remark: Negatively log-likelihood is a multiclass cross-entropy. What this means is that for a binary problem (dog or not dog), it will output two values between [0,1] where the sum of the two values equates to 1. (Dog = 0.6, Cat = 0.4). This can further be expanded into larger dimensions. In contrast, sigmoid outputs one value and in the binary case, the other value can be derived via 1 - p.<br />
<br />
Softmax will have a complexity of O(kh) where k is the number of classes and h is the number of dimensions of text representation. The function that the authors used for their model was a variation of the softmax function, known as '''Hiearchy Softmax'''. The hiearchy softmax is based on the Huffman Coding Tree and will reduce complexity to O(H*log2(k)).<br />
<br />
=== N-gram and Bag of Words ===<br />
'''Bag of Word Model''' is a model for simplifying text representation by storing a set of words and their frequency count in a document. '''Bag of word is invariant to word order (single word and dictionary based)'''. An example of the model can be found in this wikipedia page [ https://en.wikipedia.org/wiki/Bag-of-words_model].<br />
<br />
(1) John likes to watch movies. => BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}<br />
(2) Mary likes movies too. => BoW2 = {"John":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};<br />
<br />
If a document contains a union of the (1), (2), and (3) then<br />
(3) John likes to watch football game<br />
{"John":2,"likes":3,"to":2,"watch":2,"movies":2,"Mary":1,"too":1,"also":1,"football":1,"games":1};<br />
<br />
<br />
The problem with the bag of word model is that it requires an extensive dictionary of words on file and would take a long time to search through. Additionally, '''bag of word losses information due to it being single word and invariant to order.''' Lastly, it will fail if the training set does not include the entire dictionary of the testing set.<br />
<br />
'''N Gram Model (Word Based)''' is a model for simplifying text representation by storing n local words adjacent to the initial word. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. An example of how N gram stores and identifies order will be demonstrated in a later section. An example of a document being stores in both Bag of Words, Unigram (N = 1) and Bigram (N = 2) can be found below:<br />
<br />
<br />
*PLACE HOLDER FOR PICTURE*<br />
Source: https://qph.fs.quoracdn.net/main-qimg-c47060d2f02439a44795e2fbcf2ca347-c<br />
<br />
<br />
In the article, N gram model was used instead of Bag of Words because the authors wanted to capture the information about the local order.<br />
<br />
=== Feature Hashing ===<br />
The authors utilized a feature hashing to map N-gram more efficiently. (SOURCE https://en.wikipedia.org/wiki/Feature_hashing) Feature hashing is a way to vectorize n-gram of a document. It is an effective tool when dealing with n-gram of higher dimension spaces. The algorithm creates a '''hash table''', which is a special data structure that contains a hash function and a matrix. The hash function will map the appropriate n-gram into the matrix. An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
The author utilized this hashing trick from Mikolov et al. (2011) and 10M bins of only used bigrams, 100M otherwise. This suggests that there are a total combination of sqrt(10 mil) different words in the data base from the training set.<br />
<br />
== Experiment ==<br />
<br />
fastText was compared with various other text classifiers in two classification problems:<br />
* Sentiment Analysis<br />
* Tag prediction<br />
<br />
<br />
<br />
== Conclusion ==<br />
<br />
== Commentary and Criticism ==<br />
<br />
== Further Reading ==<br />
<br />
* List of previous paper presentations in chronological order relating to text classification/fastText</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=34365Bag of Tricks for Efficient Text Classification2018-03-15T18:34:51Z<p>Cs3yang: /* Softmax and Hierarchy Softmax */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier which is inexpensive in terms of training and test time can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance. The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
*PLACE HOLDER FOR IMAGE FROM ARTICLE*<br />
Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples (low frequency) will often get classified in a large output field. The model in the paper is built on top of a linear model with a rank constraint and a fast loss approximation. <br />
<br />
<br />
Each <math> N </math> represents a seperate <math> N </math>-th gram features in the sentence. This feature will be explained in a coming section.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
Softmax function ''f'' is used to compute the probability density over the predefined classes. The softmax output layer with log-likelihood is given in the article as:<br />
<br />
* <math> - \frac{1}{N} \sum_{n=1}^N y_n \log ( f(BAx_n))</math> (is this the right one?)<br />
<br />
In this formula. A and B are weight matrix which will be calculated in the training set. <math> X_n </math> is the normalizefeature of the <math> n-th </math> documentation. <math> Y_n </math> is the label.<br />
<br />
<br />
<br />
Remark: Negatively log-likelihood is a multiclass cross-entropy. What this means is that for a binary problem (dog or not dog), it will output two values between [0,1] where the sum of the two values equates to 1. (Dog = 0.6, Cat = 0.4). This can further be expanded into larger dimensions. In contrast, sigmoid outputs one value and in the binary case, the other value can be derived via 1 - p.<br />
<br />
Softmax will have a complexity of O(kh) where k is the number of classes and h is the number of dimensions of text representation. The function that the authors used for their model was a variation of the softmax function, known as '''Hiearchy Softmax'''. The hiearchy softmax is based on the Huffman Coding Tree and will reduce complexity to O(H*log2(k)).<br />
<br />
=== N-gram and Bag of Words ===<br />
'''Bag of Word Model''' is a model for simplifying text representation by storing a set of words and their frequency count in a document. '''Bag of word is invariant to word order (single word and dictionary based)'''. An example of the model can be found in this wikipedia page [ https://en.wikipedia.org/wiki/Bag-of-words_model].<br />
<br />
(1) John likes to watch movies. => BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}<br />
(2) Mary likes movies too. => BoW2 = {"John":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};<br />
<br />
If a document contains a union of the (1), (2), and (3) then<br />
(3) John likes to watch football game<br />
{"John":2,"likes":3,"to":2,"watch":2,"movies":2,"Mary":1,"too":1,"also":1,"football":1,"games":1};<br />
<br />
<br />
The problem with the bag of word model is that it requires an extensive dictionary of words on file and would take a long time to search through. Additionally, '''bag of word losses information due to it being single word and invariant to order.''' Lastly, it will fail if the training set does not include the entire dictionary of the testing set.<br />
<br />
'''N Gram Model (Word Based)''' is a model for simplifying text representation by storing n local words adjacent to the initial word. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. An example of how N gram stores and identifies order will be demonstrated in a later section. An example of a document being stores in both Bag of Words, Unigram (N = 1) and Bigram (N = 2) can be found below:<br />
<br />
<br />
*PLACE HOLDER FOR PICTURE*<br />
Source: https://qph.fs.quoracdn.net/main-qimg-c47060d2f02439a44795e2fbcf2ca347-c<br />
<br />
<br />
In the article, N gram model was used instead of Bag of Word because the authors wanted to capture the information about the local order.<br />
<br />
=== Feature Hashing ===<br />
The authors utilized a feature hashing to map N-gram more efficiently. (SOURCE https://en.wikipedia.org/wiki/Feature_hashing) Feature hashing is a way to vectorize n-gram of a document. It is an effective tool when dealing with n-gram of higher dimension spaces. The algorithm creates a '''hash table''', which is a special data structure that contains a hash function and a matrix. The hash function will map the appropriate n-gram into the matrix. An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
The author utilized this hashing trick from Mikolov et al. (2011) and 10M bins of only used bigrams, 100M otherwise. This suggests that there are a total combination of sqrt(10 mil) different words in the data base from the training set.<br />
<br />
== Experiment ==<br />
<br />
fastText was compared with various other text classifiers in two classification problems:<br />
* Sentiment Analysis<br />
* Tag prediction<br />
<br />
<br />
<br />
== Conclusion ==<br />
<br />
== Commentary and Criticism ==</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=34364Bag of Tricks for Efficient Text Classification2018-03-15T18:19:24Z<p>Cs3yang: /* Softmax and Hierarchy Softmax */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier which is inexpensive in terms of training and test time can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance. The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
*PLACE HOLDER FOR IMAGE FROM ARTICLE*<br />
Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples (low frequency) will often get classified in a large output field. The model in the paper is built on top of a linear model with a rank constraint and a fast loss approximation. <br />
<br />
<br />
Each <math> N </math> represents a seperate <math> N </math>-th gram features in the sentence. This feature will be explained in a coming section.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
Softmax function ''f'' is used to compute the probability density over the predefined classes. The softmax output layer with log-likelihood is given in the article as:<br />
<br />
<br />
*PLACE HOLDER for log-likelihood error function found in article*<br />
* <math> - \frac{1}{N} \sum_{n=1}^N y_n \log ( f(BAx_n))</math><br />
In this formula. A and B are weight matrix which will be calculated in the training set. <math> X_n </math> is the normalizefeature of the <math> n-th </math> documentation. <math> Y_n </math> is the label.<br />
<br />
<br />
<br />
Remark: Negatively log-likelihood is a multiclass cross-entropy. What this means is that for a binary problem (dog or not dog), it will output two values between [0,1] where the sum of the two values equates to 1. (Dog = 0.6, Cat = 0.4). This can further be expanded into larger dimensions. In contrast, sigmoid outputs one value and in the binary case, the other value can be derived via 1 - p.<br />
<br />
Softmax will have a complexity of O(kh) where k is the number of classes and h is the number of dimensions of text representation. The function that the authors used for their model was a variation of the softmax function, known as '''Hiearchy Softmax'''. The hiearchy softmax is based on the Huffman Coding Tree and will reduce complexity to O(H*log2(k)).<br />
<br />
=== N-gram and Bag of Words ===<br />
'''Bag of Word Model''' is a model for simplifying text representation by storing a set of words and their frequency count in a document. '''Bag of word is invariant to word order (single word and dictionary based)'''. An example of the model can be found in this wikipedia page [ https://en.wikipedia.org/wiki/Bag-of-words_model].<br />
<br />
(1) John likes to watch movies. => BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}<br />
(2) Mary likes movies too. => BoW2 = {"John":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};<br />
<br />
If a document contains a union of the (1), (2), and (3) then<br />
(3) John likes to watch football game<br />
{"John":2,"likes":3,"to":2,"watch":2,"movies":2,"Mary":1,"too":1,"also":1,"football":1,"games":1};<br />
<br />
<br />
The problem with the bag of word model is that it requires an extensive dictionary of words on file and would take a long time to search through. Additionally, '''bag of word losses information due to it being single word and invariant to order.''' Lastly, it will fail if the training set does not include the entire dictionary of the testing set.<br />
<br />
'''N Gram Model (Word Based)''' is a model for simplifying text representation by storing n local words adjacent to the initial word. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. An example of how N gram stores and identifies order will be demonstrated in a later section. An example of a document being stores in both Bag of Words, Unigram (N = 1) and Bigram (N = 2) can be found below:<br />
<br />
<br />
*PLACE HOLDER FOR PICTURE*<br />
Source: https://qph.fs.quoracdn.net/main-qimg-c47060d2f02439a44795e2fbcf2ca347-c<br />
<br />
<br />
In the article, N gram model was used instead of Bag of Word because the authors wanted to capture the information about the local order.<br />
<br />
=== Feature Hashing ===<br />
The authors utilized a feature hashing to map N-gram more efficiently. (SOURCE https://en.wikipedia.org/wiki/Feature_hashing) Feature hashing is a way to vectorize n-gram of a document. It is an effective tool when dealing with n-gram of higher dimension spaces. The algorithm creates a '''hash table''', which is a special data structure that contains a hash function and a matrix. The hash function will map the appropriate n-gram into the matrix. An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
The author utilized this hashing trick from Mikolov et al. (2011) and 10M bins of only used bigrams, 100M otherwise. This suggests that there are a total combination of sqrt(10 mil) different words in the data base from the training set.<br />
<br />
== Experiment ==<br />
<br />
fastText was compared with various other text classifiers in two classification problems:<br />
* Sentiment Analysis<br />
* Tag prediction<br />
<br />
<br />
<br />
== Conclusion ==<br />
<br />
== Commentary and Criticism ==</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=34363Bag of Tricks for Efficient Text Classification2018-03-15T18:10:09Z<p>Cs3yang: /* Introduction and Motivation */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier which is inexpensive in terms of training and test time can approximate the performance of these more complex neural networks. <br />
<br />
The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance. The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
*PLACE HOLDER FOR IMAGE FROM ARTICLE*<br />
Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples (low frequency) will often get classified in a large output field. The model in the paper is built on top of a linear model with a rank constraint and a fast loss approximation. <br />
<br />
<br />
Each <math> N </math> represents a seperate <math> N </math>-th gram features in the sentence. This feature will be explained in a coming section.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
Softmax function ''f'' is used to compute the probability density over the predefined classes. The softmax output layer with log-likelihood is given in the article as:<br />
<br />
<br />
*PLACE HOLDER for log-likelihood error function found in article*<br />
In this formula. A and B are weight matrix which will be calculated in the training set. <math> X_n </math> is the normalizefeature of the <math> n-th </math> documentation. <math> Y_n </math> is the label.<br />
<br />
<br />
<br />
Remark: Negatively log-likelihood is a multiclass cross-entropy. What this means is that for a binary problem (dog or not dog), it will output two values between [0,1] where the sum of the two values equates to 1. (Dog = 0.6, Cat = 0.4). This can further be expanded into larger dimensions. In contrast, sigmoid outputs one value and in the binary case, the other value can be derived via 1 - p.<br />
<br />
Softmax will have a complexity of O(kh) where k is the number of classes and h is the number of dimensions of text representation. The function that the authors used for their model was a variation of the softmax function, known as '''Hiearchy Softmax'''. The hiearchy softmax is based on the Huffman Coding Tree and will reduce complexity to O(H*log2(k)).<br />
<br />
<br />
=== N-gram and Bag of Words ===<br />
'''Bag of Word Model''' is a model for simplifying text representation by storing a set of words and their frequency count in a document. '''Bag of word is invariant to word order (single word and dictionary based)'''. An example of the model can be found in this wikipedia page [ https://en.wikipedia.org/wiki/Bag-of-words_model].<br />
<br />
(1) John likes to watch movies. => BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}<br />
(2) Mary likes movies too. => BoW2 = {"John":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};<br />
<br />
If a document contains a union of the (1), (2), and (3) then<br />
(3) John likes to watch football game<br />
{"John":2,"likes":3,"to":2,"watch":2,"movies":2,"Mary":1,"too":1,"also":1,"football":1,"games":1};<br />
<br />
<br />
The problem with the bag of word model is that it requires an extensive dictionary of words on file and would take a long time to search through. Additionally, '''bag of word losses information due to it being single word and invariant to order.''' Lastly, it will fail if the training set does not include the entire dictionary of the testing set.<br />
<br />
'''N Gram Model (Word Based)''' is a model for simplifying text representation by storing n local words adjacent to the initial word. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. An example of how N gram stores and identifies order will be demonstrated in a later section. An example of a document being stores in both Bag of Words, Unigram (N = 1) and Bigram (N = 2) can be found below:<br />
<br />
<br />
*PLACE HOLDER FOR PICTURE*<br />
Source: https://qph.fs.quoracdn.net/main-qimg-c47060d2f02439a44795e2fbcf2ca347-c<br />
<br />
<br />
In the article, N gram model was used instead of Bag of Word because the authors wanted to capture the information about the local order.<br />
<br />
=== Feature Hashing ===<br />
The authors utilized a feature hashing to map N-gram more efficiently. (SOURCE https://en.wikipedia.org/wiki/Feature_hashing) Feature hashing is a way to vectorize n-gram of a document. It is an effective tool when dealing with n-gram of higher dimension spaces. The algorithm creates a '''hash table''', which is a special data structure that contains a hash function and a matrix. The hash function will map the appropriate n-gram into the matrix. An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
The author utilized this hashing trick from Mikolov et al. (2011) and 10M bins of only used bigrams, 100M otherwise. This suggests that there are a total combination of sqrt(10 mil) different words in the data base from the training set.<br />
<br />
== Experiment ==<br />
<br />
fastText was compared with various other text classifiers in two classification problems:<br />
* Sentiment Analysis<br />
* Tag prediction<br />
<br />
<br />
<br />
== Conclusion ==<br />
<br />
== Commentary and Criticism ==</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=34362Bag of Tricks for Efficient Text Classification2018-03-15T17:58:46Z<p>Cs3yang: </p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Neural network have been utilized more recently for Text-Classifications and demosntated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large dataset. The authors suggests that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performances. The basis of the analysis for this paper were the approach of fastText on the two tasks: tag predictions, and sentiment analysis. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
*PLACE HOLDER FOR IMAGE FROM ARTICLE*<br />
Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples (low frequency) will often get classified in a large output field. The model in the paper is built on top of a linear model with a rank constraint and a fast loss approximation. <br />
<br />
<br />
Each <math> N </math> represents a seperate <math> N </math>-th gram features in the sentence. This feature will be explained in a coming section.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
Softmax function ''f'' is used to compute the probability density over the predefined classes. The softmax output layer with log-likelihood is given in the article as:<br />
<br />
<br />
*PLACE HOLDER for log-likelihood error function found in article*<br />
In this formula. A and B are weight matrix which will be calculated in the training set. <math> X_n </math> is the normalizefeature of the <math> n-th </math> documentation. <math> Y_n </math> is the label.<br />
<br />
<br />
<br />
Remark: Negatively log-likelihood is a multiclass cross-entropy. What this means is that for a binary problem (dog or not dog), it will output two values between [0,1] where the sum of the two values equates to 1. (Dog = 0.6, Cat = 0.4). This can further be expanded into larger dimensions. In contrast, sigmoid outputs one value and in the binary case, the other value can be derived via 1 - p.<br />
<br />
Softmax will have a complexity of O(kh) where k is the number of classes and h is the number of dimensions of text representation. The function that the authors used for their model was a variation of the softmax function, known as '''Hiearchy Softmax'''. The hiearchy softmax is based on the Huffman Coding Tree and will reduce complexity to O(H*log2(k)).<br />
<br />
<br />
=== N-gram and Bag of Words ===<br />
'''Bag of Word Model''' is a model for simplifying text representation by storing a set of words and their frequency count in a document. '''Bag of word is invariant to word order (single word and dictionary based)'''. An example of the model can be found in this wikipedia page [ https://en.wikipedia.org/wiki/Bag-of-words_model].<br />
<br />
(1) John likes to watch movies. => BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}<br />
(2) Mary likes movies too. => BoW2 = {"John":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};<br />
<br />
If a document contains a union of the (1), (2), and (3) then<br />
(3) John likes to watch football game<br />
{"John":2,"likes":3,"to":2,"watch":2,"movies":2,"Mary":1,"too":1,"also":1,"football":1,"games":1};<br />
<br />
<br />
The problem with the bag of word model is that it requires an extensive dictionary of words on file and would take a long time to search through. Additionally, '''bag of word losses information due to it being single word and invariant to order.''' Lastly, it will fail if the training set does not include the entire dictionary of the testing set.<br />
<br />
'''N Gram Model (Word Based)''' is a model for simplifying text representation by storing n local words adjacent to the initial word. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. An example of how N gram stores and identifies order will be demonstrated in a later section. An example of a document being stores in both Bag of Words, Unigram (N = 1) and Bigram (N = 2) can be found below:<br />
<br />
<br />
*PLACE HOLDER FOR PICTURE*<br />
Source: https://qph.fs.quoracdn.net/main-qimg-c47060d2f02439a44795e2fbcf2ca347-c<br />
<br />
<br />
In the article, N gram model was used instead of Bag of Word because the authors wanted to capture the information about the local order.<br />
<br />
=== Feature Hashing ===<br />
The authors utilized a feature hashing to map N-gram more efficiently. (SOURCE https://en.wikipedia.org/wiki/Feature_hashing) Feature hashing is a way to vectorize n-gram of a document. It is an effective tool when dealing with n-gram of higher dimension spaces. The algorithm creates a '''hash table''', which is a special data structure that contains a hash function and a matrix. The hash function will map the appropriate n-gram into the matrix. An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
The author utilized this hashing trick from Mikolov et al. (2011) and 10M bins of only used bigrams, 100M otherwise. This suggests that there are a total combination of sqrt(10 mil) different words in the data base from the training set.<br />
<br />
== Experiment ==<br />
<br />
fastText was compared with various other text classifiers in two classification problems:<br />
* Sentiment Analysis<br />
* Tag prediction<br />
<br />
<br />
<br />
== Conclusion ==<br />
<br />
== Commentary and Criticism ==</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Bag_of_Tricks_for_Efficient_Text_Classification&diff=34354Bag of Tricks for Efficient Text Classification2018-03-15T17:50:10Z<p>Cs3yang: /* Introduction */</p>
<hr />
<div>*WORK IN PROGRESS*<br />
<br />
== Introduction and Motivation ==<br />
<br />
Neural network have been utilized more recently for Text-Classifications and demosntated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large dataset. The authors suggests that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performances. The basis of the analysis for this paper were the approach of fastText on the two tasks: tag predictions, and sentiment analysis. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”<br />
<br />
== Model ==<br />
<br />
=== Model Architecture of fastText ===<br />
*PLACE HOLDER FOR IMAGE FROM ARTICLE*<br />
Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples (low frequency) will often get classified in a large output field. The model in the paper is built on top of a linear model with a rank constraint and a fast loss approximation. <br />
<br />
<br />
Each <math> N </math> represents a seperate <math> N </math>-th gram features in the sentence. This feature will be explained in a coming section.<br />
<br />
=== Softmax and Hierarchy Softmax ===<br />
Softmax function ''f'' is used to compute the probability density over the predefined classes. The softmax output layer with log-likelihood is given in the article as:<br />
<br />
<br />
*PLACE HOLDER for log-likelihood error function found in article*<br />
In this formula. A and B are weight matrix which will be calculated in the training set. <math> X_n </math> is the normalizefeature of the <math> n-th </math> documentation. <math> Y_n </math> is the label.<br />
<br />
<br />
<br />
Remark: Negatively log-likelihood is a multiclass cross-entropy. What this means is that for a binary problem (dog or not dog), it will output two values between [0,1] where the sum of the two values equates to 1. (Dog = 0.6, Cat = 0.4). This can further be expanded into larger dimensions. In contrast, sigmoid outputs one value and in the binary case, the other value can be derived via 1 - p.<br />
<br />
Softmax will have a complexity of O(kh) where k is the number of classes and h is the number of dimensions of text representation. The function that the authors used for their model was a variation of the softmax function, known as '''Hiearchy Softmax'''. The hiearchy softmax is based on the Huffman Coding Tree and will reduce complexity to O(H*log2(k)).<br />
<br />
<br />
=== N-gram and Bag of Words ===<br />
'''Bag of Word Model''' is a model for simplifying text representation by storing a set of words and their frequency count in a document. '''Bag of word is invariant to word order (single word and dictionary based)'''. An example of the model can be found in this wikipedia page [ https://en.wikipedia.org/wiki/Bag-of-words_model].<br />
<br />
(1) John likes to watch movies. => BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}<br />
(2) Mary likes movies too. => BoW2 = {"John":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};<br />
<br />
If a document contains a union of the (1), (2), and (3) then<br />
(3) John likes to watch football game<br />
{"John":2,"likes":3,"to":2,"watch":2,"movies":2,"Mary":1,"too":1,"also":1,"football":1,"games":1};<br />
<br />
<br />
The problem with the bag of word model is that it requires an extensive dictionary of words on file and would take a long time to search through. Additionally, '''bag of word losses information due to it being single word and invariant to order.''' Lastly, it will fail if the training set does not include the entire dictionary of the testing set.<br />
<br />
'''N Gram Model (Word Based)''' is a model for simplifying text representation by storing n local words adjacent to the initial word. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. An example of how N gram stores and identifies order will be demonstrated in a later section. An example of a document being stores in both Bag of Words, Unigram (N = 1) and Bigram (N = 2) can be found below:<br />
<br />
<br />
*PLACE HOLDER FOR PICTURE*<br />
Source: https://qph.fs.quoracdn.net/main-qimg-c47060d2f02439a44795e2fbcf2ca347-c<br />
<br />
<br />
In the article, N gram model was used instead of Bag of Word because the authors wanted to capture the information about the local order.<br />
<br />
=== Feature Hashing ===<br />
The authors utilized a feature hashing to map N-gram more efficiently. (SOURCE https://en.wikipedia.org/wiki/Feature_hashing) Feature hashing is a way to vectorize n-gram of a document. It is an effective tool when dealing with n-gram of higher dimension spaces. The algorithm creates a '''hash table''', which is a special data structure that contains a hash function and a matrix. The hash function will map the appropriate n-gram into the matrix. An example of this is found in the below example<br />
<br />
A = “I love apple”<br />
<br />
B = “apple love I”<br />
<br />
C = “I love sentence”<br />
<br />
{| class="wikitable"<br />
|+ Caption: Unigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I<br />
| 1<br />
| 1<br />
| 1<br />
|-<br />
| love<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| apple<br />
| 1<br />
| 1<br />
| 0<br />
|-<br />
| sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of '''order does not matter!'''<br />
<br />
{| class="wikitable"<br />
|+ Caption: Bigram.<br />
|-<br />
| <br />
| A<br />
| B<br />
| C<br />
|-<br />
| I love<br />
| 1<br />
| 0<br />
| 1<br />
|-<br />
| love apple<br />
| 1<br />
| 0<br />
| 0<br />
|-<br />
| apple love<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love i<br />
| 0<br />
| 1<br />
| 0<br />
|-<br />
| love sentence<br />
| 0<br />
| 0<br />
| 1<br />
|}<br />
<br />
Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.<br />
<br />
The author utilized this hashing trick from Mikolov et al. (2011) and 10M bins of only used bigrams, 100M otherwise. This suggests that there are a total combination of sqrt(10 mil) different words in the data base from the training set.</div>Cs3yanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441w18&diff=31838stat441w182018-02-13T17:38:54Z<p>Cs3yang: /* Paper presentation */</p>
<hr />
<div><br />
<br />
[https://docs.google.com/forms/d/1HrpW_lnn4jpFmoYKJBRAkm-GYa8djv9iZXcESeVB7Ts/prefill Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Mar 8 || || 1|| || || <br />
|-<br />
|Mar 8 || || 2|| || || <br />
|-<br />
|Mar 13 || || 3|| || || <br />
|-<br />
|Mar 13 || || 4|| || || <br />
|-<br />
|Mar 15 || || 5|| || || <br />
|-<br />
|Mar 15 || || 6|| || || <br />
|-<br />
|Mar 20 || || 7|| || || <br />
|-<br />
|Mar 20 || Wenling Zhang, Cong Jiang, Ziwei Song, Zhaoshan Ye || 8|| XGBoost: A Scalable Tree Boosting System. || ［https://arxiv.org/pdf/1603.02754.pdf Paper］ || <br />
|-<br />
|Mar 22 || Alice Wang, Robert Huang, Roger Wang, Renato Ferreira || 9|| || || <br />
|-<br />
|Mar 22 || || 10|| || || <br />
|-<br />
|Mar 27 || || 11|| || || <br />
|-<br />
|Mar 27 || || 12|| || || <br />
|-<br />
|Mar 29 || || 13|| || || <br />
|-<br />
|Mar 29 || Ammar Mehvee, Chen Yuan, Angelica Amores, Cheryl Yang, Alaric Chow, Brian Shin, Karan Mehta , Melody Tam || 14|| Bag of Tricks for Efficient Text Classification || [https://arxiv.org/pdf/1607.01759.pdf Paper] || <br />
|-<br />
|Apr 3 || Qici Tan, Qi Mai, Ziheng Chu, Minghao Lu || 15|| ImageNet Classification with Deep Convolutional Neural Networks || || <br />
|-<br />
|Apr 3 || Haochen, Yang, Patrick, Selina, Sigeng, Angel || 16|| || || <br />
|-</div>Cs3yang