A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 12: Line 12:


== Presented by ==
== Presented by ==
 
Qi Chu, Gloria Huang, Di Sang, Amanda Lam, Yan Jiao, Shuyue Wang, Yutong Wu
== Background ==  
== Background ==  
There is a tremendous amount of text data in various forms created from social networks, patient records, healthcare insurance data, news outlets and more. Unstructured text is easily processed and analyzed by humans, but it is significantly harder for machines to understand. However, the large volume of text produced is still an invaluable source of information, so there is a desperate need for text mining algorithms in a wide variety of applications and domains that can effectively automate the processing of large amounts of text.  
There is a tremendous amount of text data in various forms created from social networks, patient records, healthcare insurance data, news outlets and more. Unstructured text is easily processed and analyzed by humans, but it is significantly harder for machines to understand. However, the large volume of text produced is still an invaluable source of information, so there is a desperate need for text mining algorithms in a wide variety of applications and domains that can effectively automate the processing of large amounts of text.  

Revision as of 22:11, 18 November 2018

Text mining is the process of extracting meaningful information from text that is either structured (ie. databases), semi-structured (ie. XML and JSON files) or unstructured (ie. word documents, videos and images). This paper discusses the various methods essential for the text mining field, from preprocessing and classification to clustering and extraction techniques, and also touches on applications of text mining in the biomedical and health domains.

Text preprocessing is a key component in various text mining algorithms and can affect the resulting accuracy of classification models. It encodes the text in a numerical way so that various classification models and clustering methods can be used on the data. The most common method of text representation is the vector space model, and while this is a simple model, it enables efficient analysis of large collections of documents.

The classification models used in the context of text mining aims to assign predefined classes to text documents, and some of the various models used include Naive Bayes, Nearest Neighbour, Decision Tree and Support Vector Machines (SVM). Clustering also has a wide range of applications in the context of text mining, including classification, visualization and document organization. Naive clustering methods usually do not work well for text data because it has distinct characteristics which require algorithms designed specifically for text data. The most popular text clustering algorithms used are hierarchical clustering, k-means clustering and probabilistic clustering (ie. topic modelling), but there are always tradeoffs between effectiveness and efficiency.

Information extraction (IE) is another critical aspect of text mining as it automatically extracts structured information from unstructured or semi-structured text. It is essentially a kind of "supervised" form of natural language processing where the information we are looking for is known beforehand. The first part of information extraction is named entity recognition (NER) which locates and classifies named entities in free text into predefined categories, and the second part is relation extraction which seeks and locates the semantic relations between entities in text documents. Common models used for NER include Hidden Markov model (HMM) and Conditional random fields (CRF).

One of the domains where text mining is frequently used is biomedical sciences. Due to the exponential growth in biomedical literature, it is difficult for biomedical scientists to keep up with relevant publications in their own research area. Therefore, text mining methods and machine learning algorithms are widely used to overcome the information overload.


Presented by

Qi Chu, Gloria Huang, Di Sang, Amanda Lam, Yan Jiao, Shuyue Wang, Yutong Wu

Background

There is a tremendous amount of text data in various forms created from social networks, patient records, healthcare insurance data, news outlets and more. Unstructured text is easily processed and analyzed by humans, but it is significantly harder for machines to understand. However, the large volume of text produced is still an invaluable source of information, so there is a desperate need for text mining algorithms in a wide variety of applications and domains that can effectively automate the processing of large amounts of text.

[ insert example ]

[ insert history ]

Text Preprocessing

Vector Space Model

The most common way to represent documents is to convert them into numeric vectors. This representation is called "Vector Space Model"(VSM). VSM is broadly used in various text mining algorithms and IR systems and enables efficient analysis of large collection of documents. In order to allow for more formal descriptions of the algorithms, we first define some terms and variables that will be frequently used in the following: Given a collection of documents

Classification

Clustering

Information Extraction

Biomedical Ontologies

Conclusion

References