A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
[ insert introduction here ]
Text mining is the process of extracting meaningful information from text that is either structured (ie. databases), semi-structured (ie.  XML and JSON files) or unstructured (ie. word documents, videos and images).
This paper discusses the various methods essential for the text mining field, from preprocessing and classification to clustering and extraction techniques, and also touches on applications of text mining in the biomedical and health domains.
 
Text preprocessing is a key component in various text mining algorithms and can affect the resulting accuracy of classification models. It encodes the text in a numerical way so that various classification models and clustering methods can be used on the data. The most common method of text representation is the vector space model, and while this is a simple model, it enables efficient analysis of large collections of documents.
 
The classification models used in the context of text mining aims to assign predefined classes to text documents, and some of the various models used include Naive Bayes, Nearest Neighbour, Decision Tree and Support Vector Machines (SVM). Clustering also has a wide range of applications in the context of text mining, including classification, visualization and document organization. Naive clustering methods usually do not work well for text data because it has distinct characteristics which require algorithms designed specifically for text data. The most popular text clustering algorithms used are hierarchical clustering, k-means clustering and probabilistic clustering (ie. topic modelling), but there are always tradeoffs between effectiveness and efficiency.
 
Information extraction (IE) is another critical aspect of text mining as it automatically extracts structured information from unstructured or semi-structured text. It is essentially a kind of "supervised" form of natural language processing where the information we are looking for is known beforehand. The first part of information extraction is named entity recognition (NER) which locates and classifies named entities in free text into predefined categories, and the second part is relation extraction which seeks and locates the semantic relations between entities in text documents. Common models used for NER include Hidden Markov model (HMM) and Conditional random fields (CRF).
 
One of the domains where text mining is frequently used is biomedical sciences. Due to the exponential growth in biomedical literature, it is difficult for biomedical scientists to keep up with relevant publications in their own research area. Therefore, text mining methods and machine learning algorithms are widely used to overcome the information overload.
 


== Background ==  
== Background ==  
Line 13: Line 23:
== Information Extraction ==  
== Information Extraction ==  


== Biomedical ontologies ==
== Biomedical Ontologies ==


== Conclusion ==
== Conclusion ==


== References ==
== References ==

Revision as of 18:06, 18 November 2018

Text mining is the process of extracting meaningful information from text that is either structured (ie. databases), semi-structured (ie. XML and JSON files) or unstructured (ie. word documents, videos and images). This paper discusses the various methods essential for the text mining field, from preprocessing and classification to clustering and extraction techniques, and also touches on applications of text mining in the biomedical and health domains.

Text preprocessing is a key component in various text mining algorithms and can affect the resulting accuracy of classification models. It encodes the text in a numerical way so that various classification models and clustering methods can be used on the data. The most common method of text representation is the vector space model, and while this is a simple model, it enables efficient analysis of large collections of documents.

The classification models used in the context of text mining aims to assign predefined classes to text documents, and some of the various models used include Naive Bayes, Nearest Neighbour, Decision Tree and Support Vector Machines (SVM). Clustering also has a wide range of applications in the context of text mining, including classification, visualization and document organization. Naive clustering methods usually do not work well for text data because it has distinct characteristics which require algorithms designed specifically for text data. The most popular text clustering algorithms used are hierarchical clustering, k-means clustering and probabilistic clustering (ie. topic modelling), but there are always tradeoffs between effectiveness and efficiency.

Information extraction (IE) is another critical aspect of text mining as it automatically extracts structured information from unstructured or semi-structured text. It is essentially a kind of "supervised" form of natural language processing where the information we are looking for is known beforehand. The first part of information extraction is named entity recognition (NER) which locates and classifies named entities in free text into predefined categories, and the second part is relation extraction which seeks and locates the semantic relations between entities in text documents. Common models used for NER include Hidden Markov model (HMM) and Conditional random fields (CRF).

One of the domains where text mining is frequently used is biomedical sciences. Due to the exponential growth in biomedical literature, it is difficult for biomedical scientists to keep up with relevant publications in their own research area. Therefore, text mining methods and machine learning algorithms are widely used to overcome the information overload.


Background

Text Preprocessing

Vector Space Model

Classification

Clustering

Information Extraction

Biomedical Ontologies

Conclusion

References