Deep Learning for Extreme Multi-label Text Classification

From statwiki
Jump to: navigation, search

Presented by

Lingshan Wang, Yifan Li, Ziyi Liu


This paper presents a thorough walkthrough of deep learning in multi-label text classification. A great number of algorithms have been proposed since the idea of XMTC came out. However, the shortcomings of the existing methods are inevitable due to data sparsity and scalability. With deep learning and Convolutional Neural Networks kicking into play, this paper has demonstrated that a newly proposed CNN approach consistently outperforms almost all methods.

Previous Work

Target-Embedding methods: Given the case that a document can be classified using multiple labels, implementing deep learning in a set of high-dimensional embeddings could be computationally intensive and also hard to interpret. Therefore, the concept of compressing label space is introduced to effectively create lower-dimensional label vectors using either linear or non-linear projection. Those transformed vectors will then be fed into standard classifying algorithms for modeling purposes. Eventually, decompressing is required for the final decision-making.

Tree-based ensemble methods: Similar to traditional tree modeling, tree-based ensemble methods use the same concepts of partitioning features and predicting based on a smaller number of variables. Unlike normal binary trees, tree-based ensemble methods split the features fed using a hyperplane. Not only does this adjustment significantly reduce the prediction time, but it also serves as a more robust sidekick when implementing the XMTC modeling.


The ultimate goal of XMTC is to efficiently and accurately classify documents using labels. A single document can be classified into multiple categories, which leads to tons of potential barriers. Having a newly-designed deep learning model to enhance prediction accuracy and interpretability would be essential.

Model Architecture

Existing Competitive methods:

SLEEC: It is composed of both learning embeddings and K-Nearest-Neighbors (KNN). The concept behind this is to perform clustering analysis to the training data at first. This would reduce the runtime and computational power required for KNN search. Within each cluster, SLEEC compresses the high-dimensional label vectors to a low-dimensional label vector using a pre-defined distance function between the closest label vectors. Then, the normal classification method can be applied for the final decision-making.

FastXML: It is one of the tree-based ensemble methods. In addition to partitioning the set of documents at each node, this method splits all documents in the current node into two subsets aiming to learn the label rankings. NDCG score is the threshold used for all ranked label lists, which would be further used for the characterization of the distribution.

FastText: This modeling technique reads in the input document without considering the order of contents and applies a softmax layer directly to assign contents into different labels. This is the simplest and the most inaccurate approach among all 6 methods introduced.

CNN-Kim: One of the first CNN approaches used for multi-label-text classification has three essential layers embedded. The first layer serves as a filter to construct multiple feature maps using the raw document vector fed into the model. Then, the max-over-time pooling layer and the last layer would generate outputs based on the document representation.

Bow-CNN: This method uses the commonly known bag-of-words technique to construct a binary vector for each text region. After all binary vectors are fed to a convolutional layer, the subsequent dynamic pooling layer and a softmax layer would be utilized for the final output.

PD-Sparse: As a newly proposed method for extreme multiclass and multilabel classification, it first sets each label to a margin-maximizing loss with penalties, yielding an extremely sparse solution both in primal and in dual without sacrificing the expressive power of predictor. Then a Fully-Corrective Block-Coordinate Frank-Wolfe (FC-BCFW) algorithm is proposed to exploit both primal and dual sparsity to achieve a complexity sublinear to the number of primal and dual variables. The method is distinguished with its high predictive accuracy with competitive runtime.

Proposed Method:

As indicated in the architecture of our proposed method, 5 steps are being implemented. The input of this XML-CNN method is a multi-dimensional word embedding with a convolutional filter applied to each text region producing new feature mappings. The new feature mappings will then be used in dynamic max pooling. Max-over-time pooling was commonly used in the traditional CNN models. However, because only the largest value in the mapping will be carried forward in the flow, the idea of dynamic max-pooling plays a significant role in the architecture to capture a lot more information. Moving forward, the hidden bottleneck layer is for a compact representation. It avoids the direct connection between the dynamic max-pooling layer and the output layer, which could potentially lead to a gigantic model size that can not be handled by a common GPU memory. Eventually, final results will be generated through the output layer.


To evaluate the performance of the XML-CNN model architecture, six benchmark datasets and a system-predicted score vector are used for testing.

The XML-CNN model is compared against the 6 existing competitive methods. The results are as shown in the tables below:

It is obvious from the summary above that the XML-CNN has either the best or the second-best performance in all 18 lines in the table with RCV1, Amazon-12K, and Wiki-500K having a particularly strong performance. This is due to a higher number of training data in the 3 datasets mentioned above, which further proves that better modeling is based on more data points.


Googlenet outperformed the other previous deep learning networks, and it became a proof of concept that approximating the expected optimal sparse structure by readily available dense building blocks (or the inception modules) is a viable method for improving the neural networks in computer vision. The significant quality gain is at a modest increase for the computational requirement is the main advantage for this method. Even without performing any bounding box operations to detect objects, this architecture gained a significant amount of quality with a modest amount of computational resources.