Describtion of Text Mining

From statwiki
Revision as of 19:37, 22 November 2020 by H28ren (talk | contribs) (Information Extraction)
Jump to: navigation, search

Presented by

Yawen Wang, Danmeng Cui, Zijie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid


This paper focuses on the different text mining tasks and the existence of text mining in healthcare and biomedical domains. The text mining field has been popular as a result of the amount of text data that is available in different forms. The text data is bound to grow even more in 2020, indicating a 50 times growth since 2010. To further explore the text mining field, the related text mining approaches can be considered. The different text mining approaches relate to two main methods: knowledge delivery and traditional data mining methods.

The authors note that knowledge delivery methods involve the application of different steps to a specific data set to create specific patterns. Research in knowledge delivery methods has evolved over the years due to advances in hardware and software technology. On the other hand, data mining has experienced substantial development through the intersection of three fields: databases, machine learning, and statistics. As brought out by the authors, text mining approaches focus on the exploration of information from a specific text. The information explored is in the form of structured, semi-structured, and unstructured text. It is important to note that text mining covers different sets of algorithms and topics that include information retrieval. The topics and algorithms are used for analyzing different text forms.

Text Representation and Encoding

In this section of the paper, the authors explore the different ways in which the text can be represented on a large collection of documents. One common way of representing the documents is in the form of a bag of words. The bag of words considers the occurrences of different terms. In different text mining applications, documents are ranked and represented as vectors so as to display the significance of any word. The authors note that the three basic models used are vector space, inference network, and the probabilistic models. The vector space model is used to represent documents by converting them into vectors. In the model, a variable is used to represent each model to indicate the importance of the word in the document. The words are weighted using the TF-IDF scheme computed as

$$ q(w)=f_d(w)*log{\frac{|D|}{f_D(w)}} $$

In many text mining algorithms, one of the key components is preprocessing. Preprocessing consists of different tasks that include filtering, tokenization, stemming, and lemmatization. The first step is tokenization, where a character sequence is broken down into different words or phrases. After the breakdown, filtering is carried out to remove some words. The various word inflected forms are grouped together through lemmatization, and later, the derived roots of the derived words are obtained through stemming.


Classification in Text Mining aims to assigned predefined classes to text documents. For a set [math]\mathcal{D} = {d_1, d_2, ... d_n}[/math] of documents, such that each [math]d_i[/math] is mapped to a label [math]l_i[/math] from the set [math]\mathcal{L} = {l_1, l_2, ... l_k}[/math]. The goal is to find a classification model [math]f[/math] such that: [math]\\[/math] $$ f: \mathcal{D} \rightarrow \mathcal{L} $$ The author illustrates 4 different classifiers that are commonly used in text mining.

1. Naive Bayes Classifier

Bayes rule is used to classify new examples and select the class that is most has the generated result. Naive Bayes Classifier models the distribution of documents in each class using a probabilistic model assuming that the distribution of different terms is independent of each other. The models commonly used in this classifier tried to find the posterior probability of a class based on the distribution and assumes that the documents generated are based on a mixture model parameterized by [math]\theta[/math] and compute the likelihood of a document using the sum of probabilities over all mixture component.

2. Nearest Neighbour Classifier

Nearest Neighbour Classifier uses distance-based measures to perform the classification. The documents which belong to the same class are more likely "similar" or close to each other based on the similarity measure. The classification of the test documents is inferred from the class labels of similar documents in the training set.

3. Decision Tree Classifier

A hierarchical tree of the training instances, in which a condition on the attribute value is used to divide the data hierarchically. The decision tree recursively partitions the training data set into smaller subdivisions based on a set of tests defined at each node or branch. Each node of the tree is a test of some attribute of the training instance, and each branch descending from the node corresponds to one of the values of this attribute. The conditions on the nodes are commonly defined by the terms in the text documents.

4. Support Vector Machines

SVM is a form of Linear Classifiers which are models that makes a classification decision based on the value of the linear combinations of the documents features. The output of a linear predictor is defined to the [math] y=\vec{a} \cdot \vec{x} + b[/math] where [math]\vec{x}[/math] is the normalized document word frequency vector, [math]\vec{a}[/math] is a vector of coeddifients and [math]b[/math] is a scalar. Support Vector Machines attempts to fins a linear separators between various classes. An advantage of the SVM method is it is robust to high dimensionality.


Clustering have been extensively studied in the context of the text as it has a wide range of applications such as visualization and document organization.

Information Extraction

Information Extraction (IE) is the process of extracting useful, structured information from unstructured or semi-structured text. It automatically extracts based on our command.

For example, consider the following sentence, “XYZ company was founded by Peter in the year of 1950” We can identify the following information:

Founderof(Peter, XYZ) Foundedin(1950, XYZ)

The author mentioned 4 parts that are important for Information Extraction

1. Namely Entity Recognition(NER) This is the process of identifying real world entity from free text, such as "Apple Inc.", "Donald Trump", "PlayStation 5" etc. Moreover, the task is to identify the category of these entities, such as "Apple Inc." is in the category of the company, "Donald Trump" is in the category of the USA president, and "PlayStation 5" is in the category of the entertainment system.

2. Hidden Markov Model Since traditional probabilistic classification does not consider the predicted labels of neighbor words, we use the Hidden Markov Model when doing Information Extraction. This model is different because it considers the label of one word depends on the previous words that appeared.

3. Conditional Random Fields This is a technique that are widely used in Information Extraction. The definition of it is related to graph theory. let G = (V, E) be a graph and Yv stands for the index of the vertices in G. Then (X, Y) is a conditionalrandom field, when the random variables Yv , conditioned on X, obey Markov property with respect to graph, and: p(Yv |X, Yw ,w , v) = p(Yv |X, Yw ,w ∼ v), where w ∼ v means w and v are neighbors in G.

4. Relation Extraction This is a task of finding semantic relationships between word entities in text documents. Such as "Seth Curry" is the brother of "Stephen Curry", if there is a document including these two names, the task is to identify the relationship of these two entities.

Biomedical Application



Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering, and extraction techniques. arXiv preprint arXiv:1707.02919.