http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Ajylam&feedformat=atom
statwiki - User contributions [US]
2024-03-29T09:01:13Z
User contributions
MediaWiki 1.41.0
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Brief_Survey_of_Text_Mining:_Classification,_Clustering_and_Extraction_Techniques&diff=40384
A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques
2018-11-20T18:14:04Z
<p>Ajylam: /* Conclusion */</p>
<hr />
<div>Text mining is the process of extracting meaningful information from text that is either structured (ie. databases), semi-structured (ie. XML and JSON files) or unstructured (ie. word documents, videos and images). <br />
This paper discusses the various methods essential for the text mining field, from preprocessing and classification to clustering and extraction techniques, and also touches on applications of text mining in the biomedical and health domains. <br />
<br />
Text preprocessing is a key component in various text mining algorithms and can affect the resulting accuracy of classification models. It encodes the text in a numerical way so that various classification models and clustering methods can be used on the data. The most common method of text representation is the vector space model, and while this is a simple model, it enables efficient analysis of large collections of documents. <br />
<br />
The classification models used in the context of text mining aims to assign predefined classes to text documents, and some of the various models used include Naive Bayes, Nearest Neighbour, Decision Tree and Support Vector Machines (SVM). Clustering also has a wide range of applications in the context of text mining, including classification, visualization and document organization. Naive clustering methods usually do not work well for text data because it has distinct characteristics which require algorithms designed specifically for text data. The most popular text clustering algorithms used are hierarchical clustering, k-means clustering and probabilistic clustering (ie. topic modelling), but there are always tradeoffs between effectiveness and efficiency. <br />
<br />
Information extraction (IE) is another critical aspect of text mining as it automatically extracts structured information from unstructured or semi-structured text. It is essentially a kind of "supervised" form of natural language processing where the information we are looking for is known beforehand. The first part of information extraction is named entity recognition (NER) which locates and classifies named entities in free text into predefined categories, and the second part is relation extraction which seeks and locates the semantic relations between entities in text documents. Common models used for NER include Hidden Markov model (HMM) and Conditional random fields (CRF).<br />
<br />
One of the domains where text mining is frequently used is biomedical sciences. Due to the exponential growth in biomedical literature, it is difficult for biomedical scientists to keep up with relevant publications in their own research area. Therefore, text mining methods and machine learning algorithms are widely used to overcome the information overload.<br />
<br />
<br />
== Presented by ==<br />
Qi Chu, Xiaoran Huang, Di Sang, Amanda Lam, Yan Jiao, Shuyue Wang, Yutong Wu<br />
== Background == <br />
There is a tremendous amount of text data in various forms created from social networks, patient records, healthcare insurance data, news outlets and more. Unstructured text is easily processed and analyzed by humans, but it is significantly harder for machines to understand. However, the large volume of text produced is still an invaluable source of information, so there is a desperate need for text mining algorithms in a wide variety of applications and domains that can effectively automate the processing of large amounts of text. Academics is a key domain where text mining has becoming increasingly important - the National Centre for Text Mining, operated by the University of Manchester, is the first publicly-funded centre that provides text mining services for the UK academic community [http://www.nactem.ac.uk/ (NaCTeM)]. Some of the earliest known papers of applying already known or new methods to textual data for text mining include research on a cluster-based approach to browsing large document collections (1992)[https://doi.org/10.1145/133160.133214], indexing by latent semantic analysis (1990) [https://doi.org/10.1002/(SICI)1097-4571(199009)41:6&#60;391::AID-ASI1&#62;3.0.CO;2-9], and knowledge discovery in textual databases (1995)[https://dl.acm.org/citation.cfm?id=3001354]. While text mining uses similar techniques as traditional data mining, the characteristics and nature of textual data requires some specific differences.<br />
<br />
== Text Representation And Encoding == <br />
<br />
=== Vector Space Model ===<br />
The most common way to represent documents is to convert them into numeric vectors. This representation is called "Vector Space Model" (VSM). VSM is broadly used in various text mining algorithms and IR systems and enables efficient analysis of large collection of documents. In order to allow for more formal descriptions of the algorithms, we first define some terms and variables that will be frequently used in the following: Given a collection of documents <math>D = ( d_1 ,d_2 , \dotsc ,d_D )</math>, let <math>V = ( w_1 ,w_2 , \dotsc ,w_v )</math> be the set of distinct words/terms in the collection. Then <math>V</math> is called the ''vocabulary''. The ''frequency'' of the term <math>w &isin;V</math> in document <math>d &isin;D</math> is shown by <math>f_D(w)</math>. The term vector for document <math>d</math> is denoted by <math> \vec t_d\ = (f_d(w_1), f_d(w_2), \dotsc, f_d(w_v))</math>.<br />
<br />
<br />
In VSM each word is represented by a variable having a numeric value indicating the ''weight'' (importance) of the word in the document. There are two main term weight models: ''1)Boolean model'' : In this model a weight <math>w_{ij}>0</math> is assigned to each term <math>w_i &isin;d_j</math>. For any term that does not appear in <math>d_j, w_{ij} = 0</math>. ''2)Term frequency-inverse document frequency'' (TF-IDF):The most popular term weighting schemes is TF-IDF. Let <math>q</math> be this term weighting scheme, then the weight of each word <math>w_i &isin;d</math> is computed as follows:<br />
<br />
<math>q(w) = f_d(w)*log\frac{\left\vert D \right\vert}{f_D(w)}</math><br />
<br />
where <math>\left\vert D \right\vert</math> is the number of documents in the collection <math>D</math>.<br />
<br />
<br />
In TF-IDF the term frequency is normalized by ''inverse document frequency'', IDF. This normalization decreases theweight of the terms occurring more frequently in the document collection, Making sure that the matching of documents be more effected by distinctive words which have relatively low frequencies in the collection.<br />
<br />
Based on the term weighting scheme, each document is represented by a vector of term weights <math>w(d) = (w(d,w_1), w(d,w_2), \dotsc, w(d,w_v))</math>. We can compute the similarity between two documents <math>d_1</math> and <math>d_2</math>. One of the most widely used similarity measures is cosine similarity and is computed as follows:<br />
<br />
<math>S(d_1,d_2) = \cos \theta = \frac{\displaystyle d_1 · d_2}{{\displaystyle \sqrt {\sum_{i=1}^v w_{1i}^2}} · {\displaystyle \sqrt {\sum_{i=1}^v w_{2i}^2}}}</math><br />
<br />
<br />
Preprocessing is one of the key components in many text mining algorithms. It usually consists of the tasks such as tokenization, filtering, lemmatization and stemming. These four techniques are widely used in NLP.<br />
<br />
Tokenization: Tokenization is the task of breaking a character sequence up into pieces (words/phrases) called tokens, and perhaps at the same time throw away certain characters such as punctuation marks. The list of tokens then is used to further processing.<br />
<br />
[insert example] <br />
<br />
Filtering: Filtering is usually done on documents to remove some of the words. A common filtering is stop-words removal. Stop words are the words frequently appear in the text without having much content information (e.g. prepositions, conjunctions, etc). Similarly words occurring quite often in the text said to have little information to distinguish different documents and also words occurring very rarely are also possibly of no significant relevance and can be removed from the documents.<br />
<br />
[insert example]<br />
<br />
Lemmatization: Lemmatization is the task that considers the morphological analysis of the words, i.e. grouping together the various inflected forms of a word so they can be analyzed as a single item. In other words lemmatization methods try to map verb forms to infinite tense and nouns to a single form. In order to lemmatize the documents we first must specify the POS of each word of the documents and because POS is tedious and error prone, in practice stemming methods are preferred.<br />
<br />
[insert example]<br />
<br />
Stemming: Stemming methods aim at obtaining stem (root) of derived words. Stemming algorithms are indeed language dependent. The first stemming algorithm introduced in 1968. The most widely stemming method used in English is introduced in 1980.<br />
<br />
[talk about the two stemmers]<br />
<br />
== Classification ==<br />
=== Problem Statement ===<br />
The problem of classification is defined as follows. <br />
We have a training set <math>D = {d_1,d_2, . . . ,d_n }</math> of documents, such that each document di is labeled with a label ℓi from the set <math>L = {ℓ_1, ℓ_2, . . . , ℓ_k }</math>. <br />
The task is to find a classification model (classifier) <math>f</math> where <math>f : D → L, f (d) = ℓ</math>.<br />
<br />
To evaluate the performance of the classification model, we set aside a test set. After training the classifier with training set, we classify the test set and the portion of correctly classified documents to the total number of documents is called accuracy. <br />
<br />
The common evaluation metrics for text classification are precision, recall and F-1 scores. Charu C Aggarwal and others defines these metrics as follows:<br />
<br />
'''Precision:''' The fraction of the correct instances among the identified positive instances. <br />
<br />
'''Recall:''' The percentage of correct instances among all the positive instances.<br />
<br />
'''F-1 score:''' The geometric mean of precision and recall.<br />
<math>F1 = 2 × \frac{precision × recall}{precision + recall}</math><br />
<br />
=== Naive Bayes Classifier ===<br />
<br />
'''Definition:'''<br />
The Naive Bayes classifier is a simple yet widely used classifier. It makes assumptions about how the data (in our case words in documents) are generated and propose a probabilistic<br />
model based on these assumptions. Then use a set of training examples to estimate the parameters of the model. Bayes rule is used to classify new examples and select the class that is most likely has generated the example. There are two main models commonly used for naive Bayes classifications: Multi-variate Bernoulli Model and Multinomial Model. Both models try to find the posterior probability<br />
of a class, based on the distribution of the words in the document. However, Multinomial Model takes into account the frequency of the words whereas the Multi-variate Bernoulli Model does<br />
not.<br />
<br />
=== Nearest Neighbor Classifier ===<br />
The main idea is that documents which belong to the same class are more likely “similar” or close to each other based on the similarity measures. The classification of the test document is inferred from the class labels of the similar documents in the training set. If we consider the k-nearest neighbor in the training data set, the approach is called k-nearest neighbor classification and the most common class from these k neighbors is reported as the class label.<br />
<br />
=== Decision Tree Classifiers ===<br />
Decision tree is basically a hierarchical tree of the training instances, in which a condition on the attribute value is used to divide the data hierarchically.<br />
<br />
In case of text data, the conditions on the decision tree nodes are commonly defined in terms of terms in the text documents. For instance a node may be subdivided to its children relying on the presence or absence of a particular term in the document.<br />
<br />
=== Support Vector Machines ===<br />
Support Vector Machines (SVM) are a form of Linear Classifiers. In the context of text documents, Linear Classifiers are models that make a classification decision based on the value of the linear combinations of the documents features. The output of a linear predictor is defined to be <math>y = \vec{a} \cdot \vec{x} + b</math>, where <math>\vec{x} = (x1, x2, . . . , xn)</math> is the<br />
normalized document word frequency vector, <math>\vec{a} = (a1, a2, . . . , an)</math> is vector of coefficients and b is a scalar. We can interpret the predictor <math>y = \vec{a} \cdot \vec{x} + b</math> in the categorical class labels as a separating hyperplane between different classes.<br />
<br />
One advantage of the SVM method is that, it is quite robust to high dimensionality, i.e. learning is almost independent of the dimensionality of the feature space. It rarely needs feature selection<br />
since it selects data points (support vectors) required for the classification. According to Joachims and others, text data is an ideal choice for SVM classification due to sparse high dimensional<br />
nature of the text with few irrelevant features. SVM methods have been widely used in many application domains such as pattern recognition, face detection and spam filtering.<br />
<br />
== Clustering == <br />
<br />
Clustering is the task of grouping objects in a collection using a similarity function. Clustering in text mining can be in different levels such as documents, paragraphs, sentences or terms.<br />
Clustering helps enhance retrieval and browsing and is applied in many fields including classification, visualization and document organization. For example, clustering can be used to produce a table of contents or to construct a context-based retrieval systems. Various software programs such as Lemur and BOW have implementations of common clustering algorithms.<br />
<br />
A simple algorithm of clustering is representing text documents as binary vectors using the presence or absence of words. However, algorithms like this are insufficient for text representing due to documents having the following unique properties: Text representation usually has a large dimensionality but sparse underlying data; Word correlation in each text piece is generally very strong and thus should be taken into account in the clustering algorithm; Normalizing should also be included in the clustering step since documents are often of very different sizes.<br />
<br />
Common algorithms of clustering include:<br />
=== Hierarchical Clustering algorithms ===<br />
<br />
Top-down(divisive) hierarchical clustering begins with one cluster and recursively split it into sub-clusters. Bottom-up(agglomerative) hierarchical starts with each data point as a single cluster and successively merge clusters until all data points are in a single cluster.<br />
<br />
=== K-means Clustering ===<br />
<br />
Given a document set D and a similarity measure S, first select k clusters and randomly select their centroids, then recursively assign documents to clusters with the closest centroids and recalculate the centroids by taking the mean of all documents in each cluster.<br />
<br />
=== Probabilistic Clustering and Topic Models ===<br />
Topic modeling constructs a probabilistic generative model over text corpora. Latent Dirichlet Allocation is a state of the art unsupervised model that is a three-level hierarchical Bayesian model in which documents are represented as random mixtures over latent topics and each topic is characterized by a distribution over words.<br />
<br />
== Information Extraction == <br />
<br />
== Text Mining in Biomedical Domain ==<br />
The amount of biomedical literature is growing dramatically, which disturbs biomedical scientists with following up new discoveries and dealing with biomedical experimental data. For example, there is thousands of millions of academic terms in biomedical domain, and recognizing these long terms and relating them to real entity are necessary when considering using machine-learning method to facilitate biomedical researches. Therefore, applications of text mining in biomedical domain become inevitable. Recently, text mining methods have been utilized in a variety of biomedical domains such as protein structure prediction, gene clustering, biomedical hypothesis and clinical diagnosis, to name a few.<br />
<br />
Information Extraction, aforementioned, is a useful preprocessing step to extract structured information from scientific articles in biomedical literature. For instance, Named-Entity Recognition(NER) often used to link entities to formal terms in biomedical domain. Also, Relation Extraction makes it possible for several entities to set complex but clear relationships with each other. In particular, NER methods are usually grouped into several approaches: dictionary-based approach, rule-based approach and statistical approaches.<br />
<br />
Summarization is a common biomedical text mining task using information extraction method. It aims at identifying the significant aspects of one or more documents and represent them in a coherent fashion automatically. In addition, question answering is another biomedical text mining task where significantly exploits information extraction methods. It is defined as the process of producing accurate answers to questions posed by humans in a natural language.<br />
<br />
== Conclusion ==<br />
The paper gave a great overall introduction into the vast field of text mining. Each method has its own tradeoffs between effectiveness and efficiency, and choosing a particular method very much depends on the textual data that is being dealt with. The paper also discussed some of the important domains and applications of text mining, particularly in the biomedical domain, including information extraction, summarizing and question answering. Information will continue to grow exponentially as data has become an integral part of our world. For scientists, having a growing amount of resources at their fingertips is very useful; however, the amount of papers being published online are growing significantly which makes it difficult for them to keep up. Text mining is also inevitably important for social media and data companies like Facebook and Google due to the large amounts of data that they possess. In addition, it is useful for other industries that deal with a lot of textual data, such as law and media.<br />
<br />
== References ==<br />
* [https://doi.org/10.1145/133160.133214] Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. 1992. Scatter/Gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '92), Nicholas Belkin, Peter Ingwersen, and Annelise Mark Pejtersen (Eds.). ACM, New York, NY, USA, 318-329. DOI: https://doi.org/10.1145/133160.133214<br />
* [https://doi.org/10.1002/(SICI)1097-4571(199009)41:6&#60;391::AID-ASI1&#62;3.0.CO;2-9] Deerwester, S. , Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R. (1990), Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci., 41: 391-407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9<br />
* [https://dl.acm.org/citation.cfm?id=3001354] Ronen Feldman and Ido Dagan. 1995. Knowledge discovery in Textual Databases (KDT). In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD'95), Usama Fayyad and Ramasamy Uthurusamy (Eds.). AAAI Press 112-117.<br />
* [https://arxiv.org/abs/1707.02919] Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919.</div>
Ajylam
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Brief_Survey_of_Text_Mining:_Classification,_Clustering_and_Extraction_Techniques&diff=40383
A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques
2018-11-20T18:12:18Z
<p>Ajylam: /* Conclusion */</p>
<hr />
<div>Text mining is the process of extracting meaningful information from text that is either structured (ie. databases), semi-structured (ie. XML and JSON files) or unstructured (ie. word documents, videos and images). <br />
This paper discusses the various methods essential for the text mining field, from preprocessing and classification to clustering and extraction techniques, and also touches on applications of text mining in the biomedical and health domains. <br />
<br />
Text preprocessing is a key component in various text mining algorithms and can affect the resulting accuracy of classification models. It encodes the text in a numerical way so that various classification models and clustering methods can be used on the data. The most common method of text representation is the vector space model, and while this is a simple model, it enables efficient analysis of large collections of documents. <br />
<br />
The classification models used in the context of text mining aims to assign predefined classes to text documents, and some of the various models used include Naive Bayes, Nearest Neighbour, Decision Tree and Support Vector Machines (SVM). Clustering also has a wide range of applications in the context of text mining, including classification, visualization and document organization. Naive clustering methods usually do not work well for text data because it has distinct characteristics which require algorithms designed specifically for text data. The most popular text clustering algorithms used are hierarchical clustering, k-means clustering and probabilistic clustering (ie. topic modelling), but there are always tradeoffs between effectiveness and efficiency. <br />
<br />
Information extraction (IE) is another critical aspect of text mining as it automatically extracts structured information from unstructured or semi-structured text. It is essentially a kind of "supervised" form of natural language processing where the information we are looking for is known beforehand. The first part of information extraction is named entity recognition (NER) which locates and classifies named entities in free text into predefined categories, and the second part is relation extraction which seeks and locates the semantic relations between entities in text documents. Common models used for NER include Hidden Markov model (HMM) and Conditional random fields (CRF).<br />
<br />
One of the domains where text mining is frequently used is biomedical sciences. Due to the exponential growth in biomedical literature, it is difficult for biomedical scientists to keep up with relevant publications in their own research area. Therefore, text mining methods and machine learning algorithms are widely used to overcome the information overload.<br />
<br />
<br />
== Presented by ==<br />
Qi Chu, Xiaoran Huang, Di Sang, Amanda Lam, Yan Jiao, Shuyue Wang, Yutong Wu<br />
== Background == <br />
There is a tremendous amount of text data in various forms created from social networks, patient records, healthcare insurance data, news outlets and more. Unstructured text is easily processed and analyzed by humans, but it is significantly harder for machines to understand. However, the large volume of text produced is still an invaluable source of information, so there is a desperate need for text mining algorithms in a wide variety of applications and domains that can effectively automate the processing of large amounts of text. Academics is a key domain where text mining has becoming increasingly important - the National Centre for Text Mining, operated by the University of Manchester, is the first publicly-funded centre that provides text mining services for the UK academic community [http://www.nactem.ac.uk/ (NaCTeM)]. Some of the earliest known papers of applying already known or new methods to textual data for text mining include research on a cluster-based approach to browsing large document collections (1992)[https://doi.org/10.1145/133160.133214], indexing by latent semantic analysis (1990) [https://doi.org/10.1002/(SICI)1097-4571(199009)41:6&#60;391::AID-ASI1&#62;3.0.CO;2-9], and knowledge discovery in textual databases (1995)[https://dl.acm.org/citation.cfm?id=3001354]. While text mining uses similar techniques as traditional data mining, the characteristics and nature of textual data requires some specific differences.<br />
<br />
== Text Representation And Encoding == <br />
<br />
=== Vector Space Model ===<br />
The most common way to represent documents is to convert them into numeric vectors. This representation is called "Vector Space Model" (VSM). VSM is broadly used in various text mining algorithms and IR systems and enables efficient analysis of large collection of documents. In order to allow for more formal descriptions of the algorithms, we first define some terms and variables that will be frequently used in the following: Given a collection of documents <math>D = ( d_1 ,d_2 , \dotsc ,d_D )</math>, let <math>V = ( w_1 ,w_2 , \dotsc ,w_v )</math> be the set of distinct words/terms in the collection. Then <math>V</math> is called the ''vocabulary''. The ''frequency'' of the term <math>w &isin;V</math> in document <math>d &isin;D</math> is shown by <math>f_D(w)</math>. The term vector for document <math>d</math> is denoted by <math> \vec t_d\ = (f_d(w_1), f_d(w_2), \dotsc, f_d(w_v))</math>.<br />
<br />
<br />
In VSM each word is represented by a variable having a numeric value indicating the ''weight'' (importance) of the word in the document. There are two main term weight models: ''1)Boolean model'' : In this model a weight <math>w_{ij}>0</math> is assigned to each term <math>w_i &isin;d_j</math>. For any term that does not appear in <math>d_j, w_{ij} = 0</math>. ''2)Term frequency-inverse document frequency'' (TF-IDF):The most popular term weighting schemes is TF-IDF. Let <math>q</math> be this term weighting scheme, then the weight of each word <math>w_i &isin;d</math> is computed as follows:<br />
<br />
<math>q(w) = f_d(w)*log\frac{\left\vert D \right\vert}{f_D(w)}</math><br />
<br />
where <math>\left\vert D \right\vert</math> is the number of documents in the collection <math>D</math>.<br />
<br />
<br />
In TF-IDF the term frequency is normalized by ''inverse document frequency'', IDF. This normalization decreases theweight of the terms occurring more frequently in the document collection, Making sure that the matching of documents be more effected by distinctive words which have relatively low frequencies in the collection.<br />
<br />
Based on the term weighting scheme, each document is represented by a vector of term weights <math>w(d) = (w(d,w_1), w(d,w_2), \dotsc, w(d,w_v))</math>. We can compute the similarity between two documents <math>d_1</math> and <math>d_2</math>. One of the most widely used similarity measures is cosine similarity and is computed as follows:<br />
<br />
<math>S(d_1,d_2) = \cos \theta = \frac{\displaystyle d_1 · d_2}{{\displaystyle \sqrt {\sum_{i=1}^v w_{1i}^2}} · {\displaystyle \sqrt {\sum_{i=1}^v w_{2i}^2}}}</math><br />
<br />
<br />
Preprocessing is one of the key components in many text mining algorithms. It usually consists of the tasks such as tokenization, filtering, lemmatization and stemming. These four techniques are widely used in NLP.<br />
<br />
Tokenization: Tokenization is the task of breaking a character sequence up into pieces (words/phrases) called tokens, and perhaps at the same time throw away certain characters such as punctuation marks. The list of tokens then is used to further processing.<br />
<br />
[insert example] <br />
<br />
Filtering: Filtering is usually done on documents to remove some of the words. A common filtering is stop-words removal. Stop words are the words frequently appear in the text without having much content information (e.g. prepositions, conjunctions, etc). Similarly words occurring quite often in the text said to have little information to distinguish different documents and also words occurring very rarely are also possibly of no significant relevance and can be removed from the documents.<br />
<br />
[insert example]<br />
<br />
Lemmatization: Lemmatization is the task that considers the morphological analysis of the words, i.e. grouping together the various inflected forms of a word so they can be analyzed as a single item. In other words lemmatization methods try to map verb forms to infinite tense and nouns to a single form. In order to lemmatize the documents we first must specify the POS of each word of the documents and because POS is tedious and error prone, in practice stemming methods are preferred.<br />
<br />
[insert example]<br />
<br />
Stemming: Stemming methods aim at obtaining stem (root) of derived words. Stemming algorithms are indeed language dependent. The first stemming algorithm introduced in 1968. The most widely stemming method used in English is introduced in 1980.<br />
<br />
[talk about the two stemmers]<br />
<br />
== Classification ==<br />
=== Problem Statement ===<br />
The problem of classification is defined as follows. <br />
We have a training set <math>D = {d_1,d_2, . . . ,d_n }</math> of documents, such that each document di is labeled with a label ℓi from the set <math>L = {ℓ_1, ℓ_2, . . . , ℓ_k }</math>. <br />
The task is to find a classification model (classifier) <math>f</math> where <math>f : D → L, f (d) = ℓ</math>.<br />
<br />
To evaluate the performance of the classification model, we set aside a test set. After training the classifier with training set, we classify the test set and the portion of correctly classified documents to the total number of documents is called accuracy. <br />
<br />
The common evaluation metrics for text classification are precision, recall and F-1 scores. Charu C Aggarwal and others defines these metrics as follows:<br />
<br />
'''Precision:''' The fraction of the correct instances among the identified positive instances. <br />
<br />
'''Recall:''' The percentage of correct instances among all the positive instances.<br />
<br />
'''F-1 score:''' The geometric mean of precision and recall.<br />
<math>F1 = 2 × \frac{precision × recall}{precision + recall}</math><br />
<br />
=== Naive Bayes Classifier ===<br />
<br />
'''Definition:'''<br />
The Naive Bayes classifier is a simple yet widely used classifier. It makes assumptions about how the data (in our case words in documents) are generated and propose a probabilistic<br />
model based on these assumptions. Then use a set of training examples to estimate the parameters of the model. Bayes rule is used to classify new examples and select the class that is most likely has generated the example. There are two main models commonly used for naive Bayes classifications: Multi-variate Bernoulli Model and Multinomial Model. Both models try to find the posterior probability<br />
of a class, based on the distribution of the words in the document. However, Multinomial Model takes into account the frequency of the words whereas the Multi-variate Bernoulli Model does<br />
not.<br />
<br />
=== Nearest Neighbor Classifier ===<br />
The main idea is that documents which belong to the same class are more likely “similar” or close to each other based on the similarity measures. The classification of the test document is inferred from the class labels of the similar documents in the training set. If we consider the k-nearest neighbor in the training data set, the approach is called k-nearest neighbor classification and the most common class from these k neighbors is reported as the class label.<br />
<br />
=== Decision Tree Classifiers ===<br />
Decision tree is basically a hierarchical tree of the training instances, in which a condition on the attribute value is used to divide the data hierarchically.<br />
<br />
In case of text data, the conditions on the decision tree nodes are commonly defined in terms of terms in the text documents. For instance a node may be subdivided to its children relying on the presence or absence of a particular term in the document.<br />
<br />
=== Support Vector Machines ===<br />
Support Vector Machines (SVM) are a form of Linear Classifiers. In the context of text documents, Linear Classifiers are models that make a classification decision based on the value of the linear combinations of the documents features. The output of a linear predictor is defined to be <math>y = \vec{a} \cdot \vec{x} + b</math>, where <math>\vec{x} = (x1, x2, . . . , xn)</math> is the<br />
normalized document word frequency vector, <math>\vec{a} = (a1, a2, . . . , an)</math> is vector of coefficients and b is a scalar. We can interpret the predictor <math>y = \vec{a} \cdot \vec{x} + b</math> in the categorical class labels as a separating hyperplane between different classes.<br />
<br />
One advantage of the SVM method is that, it is quite robust to high dimensionality, i.e. learning is almost independent of the dimensionality of the feature space. It rarely needs feature selection<br />
since it selects data points (support vectors) required for the classification. According to Joachims and others, text data is an ideal choice for SVM classification due to sparse high dimensional<br />
nature of the text with few irrelevant features. SVM methods have been widely used in many application domains such as pattern recognition, face detection and spam filtering.<br />
<br />
== Clustering == <br />
<br />
Clustering is the task of grouping objects in a collection using a similarity function. Clustering in text mining can be in different levels such as documents, paragraphs, sentences or terms.<br />
Clustering helps enhance retrieval and browsing and is applied in many fields including classification, visualization and document organization. For example, clustering can be used to produce a table of contents or to construct a context-based retrieval systems. Various software programs such as Lemur and BOW have implementations of common clustering algorithms.<br />
<br />
A simple algorithm of clustering is representing text documents as binary vectors using the presence or absence of words. However, algorithms like this are insufficient for text representing due to documents having the following unique properties: Text representation usually has a large dimensionality but sparse underlying data; Word correlation in each text piece is generally very strong and thus should be taken into account in the clustering algorithm; Normalizing should also be included in the clustering step since documents are often of very different sizes.<br />
<br />
Common algorithms of clustering include:<br />
=== Hierarchical Clustering algorithms ===<br />
<br />
Top-down(divisive) hierarchical clustering begins with one cluster and recursively split it into sub-clusters. Bottom-up(agglomerative) hierarchical starts with each data point as a single cluster and successively merge clusters until all data points are in a single cluster.<br />
<br />
=== K-means Clustering ===<br />
<br />
Given a document set D and a similarity measure S, first select k clusters and randomly select their centroids, then recursively assign documents to clusters with the closest centroids and recalculate the centroids by taking the mean of all documents in each cluster.<br />
<br />
=== Probabilistic Clustering and Topic Models ===<br />
Topic modeling constructs a probabilistic generative model over text corpora. Latent Dirichlet Allocation is a state of the art unsupervised model that is a three-level hierarchical Bayesian model in which documents are represented as random mixtures over latent topics and each topic is characterized by a distribution over words.<br />
<br />
== Information Extraction == <br />
<br />
== Text Mining in Biomedical Domain ==<br />
The amount of biomedical literature is growing dramatically, which disturbs biomedical scientists with following up new discoveries and dealing with biomedical experimental data. For example, there is thousands of millions of academic terms in biomedical domain, and recognizing these long terms and relating them to real entity are necessary when considering using machine-learning method to facilitate biomedical researches. Therefore, applications of text mining in biomedical domain become inevitable. Recently, text mining methods have been utilized in a variety of biomedical domains such as protein structure prediction, gene clustering, biomedical hypothesis and clinical diagnosis, to name a few.<br />
<br />
Information Extraction, aforementioned, is a useful preprocessing step to extract structured information from scientific articles in biomedical literature. For instance, Named-Entity Recognition(NER) often used to link entities to formal terms in biomedical domain. Also, Relation Extraction makes it possible for several entities to set complex but clear relationships with each other. In particular, NER methods are usually grouped into several approaches: dictionary-based approach, rule-based approach and statistical approaches.<br />
<br />
Summarization is a common biomedical text mining task using information extraction method. It aims at identifying the significant aspects of one or more documents and represent them in a coherent fashion automatically. In addition, question answering is another biomedical text mining task where significantly exploits information extraction methods. It is defined as the process of producing accurate answers to questions posed by humans in a natural language.<br />
<br />
== Conclusion ==<br />
The paper gave a great overall introduction into the vast field of text mining. Each method has its own tradeoffs between effectiveness and efficiency, and choosing a particular method very much depends on the textual data that is being dealt with. The paper also discussed some of the important domains and applications of text mining, particularly in the biomedical domain. Information will continue to grow exponentially as data has become an integral part of our world. For scientists, having a growing amount of resources at their fingertips is very useful; however, the amount of papers being published online are growing significantly which makes it difficult for them to keep up.<br />
Text mining is also important for social media and data companies like Facebook and Google due to the large amounts of data that they possess. In addition, it is useful for other industries that deal with a lot of textual data, such as law and media.<br />
<br />
== References ==<br />
* [https://doi.org/10.1145/133160.133214] Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. 1992. Scatter/Gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '92), Nicholas Belkin, Peter Ingwersen, and Annelise Mark Pejtersen (Eds.). ACM, New York, NY, USA, 318-329. DOI: https://doi.org/10.1145/133160.133214<br />
* [https://doi.org/10.1002/(SICI)1097-4571(199009)41:6&#60;391::AID-ASI1&#62;3.0.CO;2-9] Deerwester, S. , Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R. (1990), Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci., 41: 391-407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9<br />
* [https://dl.acm.org/citation.cfm?id=3001354] Ronen Feldman and Ido Dagan. 1995. Knowledge discovery in Textual Databases (KDT). In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD'95), Usama Fayyad and Ramasamy Uthurusamy (Eds.). AAAI Press 112-117.<br />
* [https://arxiv.org/abs/1707.02919] Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919.</div>
Ajylam
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Brief_Survey_of_Text_Mining:_Classification,_Clustering_and_Extraction_Techniques&diff=40382
A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques
2018-11-20T17:56:07Z
<p>Ajylam: /* Background */</p>
<hr />
<div>Text mining is the process of extracting meaningful information from text that is either structured (ie. databases), semi-structured (ie. XML and JSON files) or unstructured (ie. word documents, videos and images). <br />
This paper discusses the various methods essential for the text mining field, from preprocessing and classification to clustering and extraction techniques, and also touches on applications of text mining in the biomedical and health domains. <br />
<br />
Text preprocessing is a key component in various text mining algorithms and can affect the resulting accuracy of classification models. It encodes the text in a numerical way so that various classification models and clustering methods can be used on the data. The most common method of text representation is the vector space model, and while this is a simple model, it enables efficient analysis of large collections of documents. <br />
<br />
The classification models used in the context of text mining aims to assign predefined classes to text documents, and some of the various models used include Naive Bayes, Nearest Neighbour, Decision Tree and Support Vector Machines (SVM). Clustering also has a wide range of applications in the context of text mining, including classification, visualization and document organization. Naive clustering methods usually do not work well for text data because it has distinct characteristics which require algorithms designed specifically for text data. The most popular text clustering algorithms used are hierarchical clustering, k-means clustering and probabilistic clustering (ie. topic modelling), but there are always tradeoffs between effectiveness and efficiency. <br />
<br />
Information extraction (IE) is another critical aspect of text mining as it automatically extracts structured information from unstructured or semi-structured text. It is essentially a kind of "supervised" form of natural language processing where the information we are looking for is known beforehand. The first part of information extraction is named entity recognition (NER) which locates and classifies named entities in free text into predefined categories, and the second part is relation extraction which seeks and locates the semantic relations between entities in text documents. Common models used for NER include Hidden Markov model (HMM) and Conditional random fields (CRF).<br />
<br />
One of the domains where text mining is frequently used is biomedical sciences. Due to the exponential growth in biomedical literature, it is difficult for biomedical scientists to keep up with relevant publications in their own research area. Therefore, text mining methods and machine learning algorithms are widely used to overcome the information overload.<br />
<br />
<br />
== Presented by ==<br />
Qi Chu, Xiaoran Huang, Di Sang, Amanda Lam, Yan Jiao, Shuyue Wang, Yutong Wu<br />
== Background == <br />
There is a tremendous amount of text data in various forms created from social networks, patient records, healthcare insurance data, news outlets and more. Unstructured text is easily processed and analyzed by humans, but it is significantly harder for machines to understand. However, the large volume of text produced is still an invaluable source of information, so there is a desperate need for text mining algorithms in a wide variety of applications and domains that can effectively automate the processing of large amounts of text. Academics is a key domain where text mining has becoming increasingly important - the National Centre for Text Mining, operated by the University of Manchester, is the first publicly-funded centre that provides text mining services for the UK academic community [http://www.nactem.ac.uk/ (NaCTeM)]. Some of the earliest known papers of applying already known or new methods to textual data for text mining include research on a cluster-based approach to browsing large document collections (1992)[https://doi.org/10.1145/133160.133214], indexing by latent semantic analysis (1990) [https://doi.org/10.1002/(SICI)1097-4571(199009)41:6&#60;391::AID-ASI1&#62;3.0.CO;2-9], and knowledge discovery in textual databases (1995)[https://dl.acm.org/citation.cfm?id=3001354]. While text mining uses similar techniques as traditional data mining, the characteristics and nature of textual data requires some specific differences.<br />
<br />
== Text Representation And Encoding == <br />
<br />
=== Vector Space Model ===<br />
The most common way to represent documents is to convert them into numeric vectors. This representation is called "Vector Space Model" (VSM). VSM is broadly used in various text mining algorithms and IR systems and enables efficient analysis of large collection of documents. In order to allow for more formal descriptions of the algorithms, we first define some terms and variables that will be frequently used in the following: Given a collection of documents <math>D = ( d_1 ,d_2 , \dotsc ,d_D )</math>, let <math>V = ( w_1 ,w_2 , \dotsc ,w_v )</math> be the set of distinct words/terms in the collection. Then <math>V</math> is called the ''vocabulary''. The ''frequency'' of the term <math>w &isin;V</math> in document <math>d &isin;D</math> is shown by <math>f_D(w)</math>. The term vector for document <math>d</math> is denoted by <math> \vec t_d\ = (f_d(w_1), f_d(w_2), \dotsc, f_d(w_v))</math>.<br />
<br />
<br />
In VSM each word is represented by a variable having a numeric value indicating the ''weight'' (importance) of the word in the document. There are two main term weight models: ''1)Boolean model'' : In this model a weight <math>w_{ij}>0</math> is assigned to each term <math>w_i &isin;d_j</math>. For any term that does not appear in <math>d_j, w_{ij} = 0</math>. ''2)Term frequency-inverse document frequency'' (TF-IDF):The most popular term weighting schemes is TF-IDF. Let <math>q</math> be this term weighting scheme, then the weight of each word <math>w_i &isin;d</math> is computed as follows:<br />
<br />
<math>q(w) = f_d(w)*log\frac{\left\vert D \right\vert}{f_D(w)}</math><br />
<br />
where <math>\left\vert D \right\vert</math> is the number of documents in the collection <math>D</math>.<br />
<br />
<br />
In TF-IDF the term frequency is normalized by ''inverse document frequency'', IDF. This normalization decreases theweight of the terms occurring more frequently in the document collection, Making sure that the matching of documents be more effected by distinctive words which have relatively low frequencies in the collection.<br />
<br />
Based on the term weighting scheme, each document is represented by a vector of term weights <math>w(d) = (w(d,w_1), w(d,w_2), \dotsc, w(d,w_v))</math>. We can compute the similarity between two documents <math>d_1</math> and <math>d_2</math>. One of the most widely used similarity measures is cosine similarity and is computed as follows:<br />
<br />
<math>S(d_1,d_2) = \cos \theta = \frac{\displaystyle d_1 · d_2}{{\displaystyle \sqrt {\sum_{i=1}^v w_{1i}^2}} · {\displaystyle \sqrt {\sum_{i=1}^v w_{2i}^2}}}</math><br />
<br />
<br />
Preprocessing is one of the key components in many text mining algorithms. It usually consists of the tasks such as tokenization, filtering, lemmatization and stemming. These four techniques are widely used in NLP.<br />
<br />
Tokenization: Tokenization is the task of breaking a character sequence up into pieces (words/phrases) called tokens, and perhaps at the same time throw away certain characters such as punctuation marks. The list of tokens then is used to further processing.<br />
<br />
[insert example] <br />
<br />
Filtering: Filtering is usually done on documents to remove some of the words. A common filtering is stop-words removal. Stop words are the words frequently appear in the text without having much content information (e.g. prepositions, conjunctions, etc). Similarly words occurring quite often in the text said to have little information to distinguish different documents and also words occurring very rarely are also possibly of no significant relevance and can be removed from the documents.<br />
<br />
[insert example]<br />
<br />
Lemmatization: Lemmatization is the task that considers the morphological analysis of the words, i.e. grouping together the various inflected forms of a word so they can be analyzed as a single item. In other words lemmatization methods try to map verb forms to infinite tense and nouns to a single form. In order to lemmatize the documents we first must specify the POS of each word of the documents and because POS is tedious and error prone, in practice stemming methods are preferred.<br />
<br />
[insert example]<br />
<br />
Stemming: Stemming methods aim at obtaining stem (root) of derived words. Stemming algorithms are indeed language dependent. The first stemming algorithm introduced in 1968. The most widely stemming method used in English is introduced in 1980.<br />
<br />
[talk about the two stemmers]<br />
<br />
== Classification ==<br />
=== Problem Statement ===<br />
The problem of classification is defined as follows. <br />
We have a training set <math>D = {d_1,d_2, . . . ,d_n }</math> of documents, such that each document di is labeled with a label ℓi from the set <math>L = {ℓ_1, ℓ_2, . . . , ℓ_k }</math>. <br />
The task is to find a classification model (classifier) <math>f</math> where <math>f : D → L, f (d) = ℓ</math>.<br />
<br />
To evaluate the performance of the classification model, we set aside a test set. After training the classifier with training set, we classify the test set and the portion of correctly classified documents to the total number of documents is called accuracy. <br />
<br />
The common evaluation metrics for text classification are precision, recall and F-1 scores. Charu C Aggarwal and others defines these metrics as follows:<br />
<br />
'''Precision:''' The fraction of the correct instances among the identified positive instances. <br />
<br />
'''Recall:''' The percentage of correct instances among all the positive instances.<br />
<br />
'''F-1 score:''' The geometric mean of precision and recall.<br />
<math>F1 = 2 × \frac{precision × recall}{precision + recall}</math><br />
<br />
=== Naive Bayes Classifier ===<br />
<br />
'''Definition:'''<br />
The Naive Bayes classifier is a simple yet widely used classifier. It makes assumptions about how the data (in our case words in documents) are generated and propose a probabilistic<br />
model based on these assumptions. Then use a set of training examples to estimate the parameters of the model. Bayes rule is used to classify new examples and select the class that is most likely has generated the example. There are two main models commonly used for naive Bayes classifications: Multi-variate Bernoulli Model and Multinomial Model. Both models try to find the posterior probability<br />
of a class, based on the distribution of the words in the document. However, Multinomial Model takes into account the frequency of the words whereas the Multi-variate Bernoulli Model does<br />
not.<br />
<br />
=== Nearest Neighbor Classifier ===<br />
The main idea is that documents which belong to the same class are more likely “similar” or close to each other based on the similarity measures. The classification of the test document is inferred from the class labels of the similar documents in the training set. If we consider the k-nearest neighbor in the training data set, the approach is called k-nearest neighbor classification and the most common class from these k neighbors is reported as the class label.<br />
<br />
=== Decision Tree Classifiers ===<br />
Decision tree is basically a hierarchical tree of the training instances, in which a condition on the attribute value is used to divide the data hierarchically.<br />
<br />
In case of text data, the conditions on the decision tree nodes are commonly defined in terms of terms in the text documents. For instance a node may be subdivided to its children relying on the presence or absence of a particular term in the document.<br />
<br />
=== Support Vector Machines ===<br />
Support Vector Machines (SVM) are a form of Linear Classifiers. In the context of text documents, Linear Classifiers are models that make a classification decision based on the value of the linear combinations of the documents features. The output of a linear predictor is defined to be <math>y = \vec{a} \cdot \vec{x} + b</math>, where <math>\vec{x} = (x1, x2, . . . , xn)</math> is the<br />
normalized document word frequency vector, <math>\vec{a} = (a1, a2, . . . , an)</math> is vector of coefficients and b is a scalar. We can interpret the predictor <math>y = \vec{a} \cdot \vec{x} + b</math> in the categorical class labels as a separating hyperplane between different classes.<br />
<br />
One advantage of the SVM method is that, it is quite robust to high dimensionality, i.e. learning is almost independent of the dimensionality of the feature space. It rarely needs feature selection<br />
since it selects data points (support vectors) required for the classification. According to Joachims and others, text data is an ideal choice for SVM classification due to sparse high dimensional<br />
nature of the text with few irrelevant features. SVM methods have been widely used in many application domains such as pattern recognition, face detection and spam filtering.<br />
<br />
== Clustering == <br />
<br />
Clustering is the task of grouping objects in a collection using a similarity function. Clustering in text mining can be in different levels such as documents, paragraphs, sentences or terms.<br />
Clustering helps enhance retrieval and browsing and is applied in many fields including classification, visualization and document organization. For example, clustering can be used to produce a table of contents or to construct a context-based retrieval systems. Various software programs such as Lemur and BOW have implementations of common clustering algorithms.<br />
<br />
A simple algorithm of clustering is representing text documents as binary vectors using the presence or absence of words. However, algorithms like this are insufficient for text representing due to documents having the following unique properties: Text representation usually has a large dimensionality but sparse underlying data; Word correlation in each text piece is generally very strong and thus should be taken into account in the clustering algorithm; Normalizing should also be included in the clustering step since documents are often of very different sizes.<br />
<br />
Common algorithms of clustering include:<br />
=== Hierarchical Clustering algorithms ===<br />
<br />
Top-down(divisive) hierarchical clustering begins with one cluster and recursively split it into sub-clusters. Bottom-up(agglomerative) hierarchical starts with each data point as a single cluster and successively merge clusters until all data points are in a single cluster.<br />
<br />
=== K-means Clustering ===<br />
<br />
Given a document set D and a similarity measure S, first select k clusters and randomly select their centroids, then recursively assign documents to clusters with the closest centroids and recalculate the centroids by taking the mean of all documents in each cluster.<br />
<br />
=== Probabilistic Clustering and Topic Models ===<br />
Topic modeling constructs a probabilistic generative model over text corpora. Latent Dirichlet Allocation is a state of the art unsupervised model that is a three-level hierarchical Bayesian model in which documents are represented as random mixtures over latent topics and each topic is characterized by a distribution over words.<br />
<br />
== Information Extraction == <br />
<br />
== Text Mining in Biomedical Domain ==<br />
The amount of biomedical literature is growing dramatically, which disturbs biomedical scientists with following up new discoveries and dealing with biomedical experimental data. For example, there is thousands of millions of academic terms in biomedical domain, and recognizing these long terms and relating them to real entity are necessary when considering using machine-learning method to facilitate biomedical researches. Therefore, applications of text mining in biomedical domain become inevitable. Recently, text mining methods have been utilized in a variety of biomedical domains such as protein structure prediction, gene clustering, biomedical hypothesis and clinical diagnosis, to name a few.<br />
<br />
Information Extraction, aforementioned, is a useful preprocessing step to extract structured information from scientific articles in biomedical literature. For instance, Named-Entity Recognition(NER) often used to link entities to formal terms in biomedical domain. Also, Relation Extraction makes it possible for several entities to set complex but clear relationships with each other. In particular, NER methods are usually grouped into several approaches: dictionary-based approach, rule-based approach and statistical approaches.<br />
<br />
Summarization is a common biomedical text mining task using information extraction method. It aims at identifying the significant aspects of one or more documents and represent them in a coherent fashion automatically. In addition, question answering is another biomedical text mining task where significantly exploits information extraction methods. It is defined as the process of producing accurate answers to questions posed by humans in a natural language.<br />
<br />
== Conclusion ==<br />
<br />
== References ==<br />
* [https://doi.org/10.1145/133160.133214] Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. 1992. Scatter/Gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '92), Nicholas Belkin, Peter Ingwersen, and Annelise Mark Pejtersen (Eds.). ACM, New York, NY, USA, 318-329. DOI: https://doi.org/10.1145/133160.133214<br />
* [https://doi.org/10.1002/(SICI)1097-4571(199009)41:6&#60;391::AID-ASI1&#62;3.0.CO;2-9] Deerwester, S. , Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R. (1990), Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci., 41: 391-407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9<br />
* [https://dl.acm.org/citation.cfm?id=3001354] Ronen Feldman and Ido Dagan. 1995. Knowledge discovery in Textual Databases (KDT). In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD'95), Usama Fayyad and Ramasamy Uthurusamy (Eds.). AAAI Press 112-117.<br />
* [https://arxiv.org/abs/1707.02919] Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919.</div>
Ajylam
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Brief_Survey_of_Text_Mining:_Classification,_Clustering_and_Extraction_Techniques&diff=40381
A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques
2018-11-20T17:46:42Z
<p>Ajylam: /* References */</p>
<hr />
<div>Text mining is the process of extracting meaningful information from text that is either structured (ie. databases), semi-structured (ie. XML and JSON files) or unstructured (ie. word documents, videos and images). <br />
This paper discusses the various methods essential for the text mining field, from preprocessing and classification to clustering and extraction techniques, and also touches on applications of text mining in the biomedical and health domains. <br />
<br />
Text preprocessing is a key component in various text mining algorithms and can affect the resulting accuracy of classification models. It encodes the text in a numerical way so that various classification models and clustering methods can be used on the data. The most common method of text representation is the vector space model, and while this is a simple model, it enables efficient analysis of large collections of documents. <br />
<br />
The classification models used in the context of text mining aims to assign predefined classes to text documents, and some of the various models used include Naive Bayes, Nearest Neighbour, Decision Tree and Support Vector Machines (SVM). Clustering also has a wide range of applications in the context of text mining, including classification, visualization and document organization. Naive clustering methods usually do not work well for text data because it has distinct characteristics which require algorithms designed specifically for text data. The most popular text clustering algorithms used are hierarchical clustering, k-means clustering and probabilistic clustering (ie. topic modelling), but there are always tradeoffs between effectiveness and efficiency. <br />
<br />
Information extraction (IE) is another critical aspect of text mining as it automatically extracts structured information from unstructured or semi-structured text. It is essentially a kind of "supervised" form of natural language processing where the information we are looking for is known beforehand. The first part of information extraction is named entity recognition (NER) which locates and classifies named entities in free text into predefined categories, and the second part is relation extraction which seeks and locates the semantic relations between entities in text documents. Common models used for NER include Hidden Markov model (HMM) and Conditional random fields (CRF).<br />
<br />
One of the domains where text mining is frequently used is biomedical sciences. Due to the exponential growth in biomedical literature, it is difficult for biomedical scientists to keep up with relevant publications in their own research area. Therefore, text mining methods and machine learning algorithms are widely used to overcome the information overload.<br />
<br />
<br />
== Presented by ==<br />
Qi Chu, Xiaoran Huang, Di Sang, Amanda Lam, Yan Jiao, Shuyue Wang, Yutong Wu<br />
== Background == <br />
There is a tremendous amount of text data in various forms created from social networks, patient records, healthcare insurance data, news outlets and more. Unstructured text is easily processed and analyzed by humans, but it is significantly harder for machines to understand. However, the large volume of text produced is still an invaluable source of information, so there is a desperate need for text mining algorithms in a wide variety of applications and domains that can effectively automate the processing of large amounts of text. <br />
<br />
[ insert example ] <br />
<br />
[ insert history ] <br />
<br />
== Text Representation And Encoding == <br />
<br />
=== Vector Space Model ===<br />
The most common way to represent documents is to convert them into numeric vectors. This representation is called "Vector Space Model" (VSM). VSM is broadly used in various text mining algorithms and IR systems and enables efficient analysis of large collection of documents. In order to allow for more formal descriptions of the algorithms, we first define some terms and variables that will be frequently used in the following: Given a collection of documents <math>D = ( d_1 ,d_2 , \dotsc ,d_D )</math>, let <math>V = ( w_1 ,w_2 , \dotsc ,w_v )</math> be the set of distinct words/terms in the collection. Then <math>V</math> is called the ''vocabulary''. The ''frequency'' of the term <math>w &isin;V</math> in document <math>d &isin;D</math> is shown by <math>f_D(w)</math>. The term vector for document <math>d</math> is denoted by <math> \vec t_d\ = (f_d(w_1), f_d(w_2), \dotsc, f_d(w_v))</math>.<br />
<br />
<br />
In VSM each word is represented by a variable having a numeric value indicating the ''weight'' (importance) of the word in the document. There are two main term weight models: ''1)Boolean model'' : In this model a weight <math>w_{ij}>0</math> is assigned to each term <math>w_i &isin;d_j</math>. For any term that does not appear in <math>d_j, w_{ij} = 0</math>. ''2)Term frequency-inverse document frequency'' (TF-IDF):The most popular term weighting schemes is TF-IDF. Let <math>q</math> be this term weighting scheme, then the weight of each word <math>w_i &isin;d</math> is computed as follows:<br />
<br />
<math>q(w) = f_d(w)*log\frac{\left\vert D \right\vert}{f_D(w)}</math><br />
<br />
where <math>\left\vert D \right\vert</math> is the number of documents in the collection <math>D</math>.<br />
<br />
<br />
In TF-IDF the term frequency is normalized by ''inverse document frequency'', IDF. This normalization decreases theweight of the terms occurring more frequently in the document collection, Making sure that the matching of documents be more effected by distinctive words which have relatively low frequencies in the collection.<br />
<br />
Based on the term weighting scheme, each document is represented by a vector of term weights <math>w(d) = (w(d,w_1), w(d,w_2), \dotsc, w(d,w_v))</math>. We can compute the similarity between two documents <math>d_1</math> and <math>d_2</math>. One of the most widely used similarity measures is cosine similarity and is computed as follows:<br />
<br />
<math>S(d_1,d_2) = \cos \theta = \frac{\displaystyle d_1 · d_2}{{\displaystyle \sqrt {\sum_{i=1}^v w_{1i}^2}} · {\displaystyle \sqrt {\sum_{i=1}^v w_{2i}^2}}}</math><br />
<br />
<br />
Preprocessing is one of the key components in many text mining algorithms. It usually consists of the tasks such as tokenization, filtering, lemmatization and stemming. These four techniques are widely used in NLP.<br />
<br />
Tokenization: Tokenization is the task of breaking a character sequence up into pieces (words/phrases) called tokens, and perhaps at the same time throw away certain characters such as punctuation marks. The list of tokens then is used to further processing.<br />
<br />
[insert example] <br />
<br />
Filtering: Filtering is usually done on documents to remove some of the words. A common filtering is stop-words removal. Stop words are the words frequently appear in the text without having much content information (e.g. prepositions, conjunctions, etc). Similarly words occurring quite often in the text said to have little information to distinguish different documents and also words occurring very rarely are also possibly of no significant relevance and can be removed from the documents.<br />
<br />
[insert example]<br />
<br />
Lemmatization: Lemmatization is the task that considers the morphological analysis of the words, i.e. grouping together the various inflected forms of a word so they can be analyzed as a single item. In other words lemmatization methods try to map verb forms to infinite tense and nouns to a single form. In order to lemmatize the documents we first must specify the POS of each word of the documents and because POS is tedious and error prone, in practice stemming methods are preferred.<br />
<br />
[insert example]<br />
<br />
Stemming: Stemming methods aim at obtaining stem (root) of derived words. Stemming algorithms are indeed language dependent. The first stemming algorithm introduced in 1968. The most widely stemming method used in English is introduced in 1980.<br />
<br />
[talk about the two stemmers]<br />
<br />
== Classification ==<br />
=== Problem Statement ===<br />
The problem of classification is defined as follows. <br />
We have a training set <math>D = {d_1,d_2, . . . ,d_n }</math> of documents, such that each document di is labeled with a label ℓi from the set <math>L = {ℓ_1, ℓ_2, . . . , ℓ_k }</math>. <br />
The task is to find a classification model (classifier) <math>f</math> where <math>f : D → L, f (d) = ℓ</math>.<br />
<br />
To evaluate the performance of the classification model, we set aside a test set. After training the classifier with training set, we classify the test set and the portion of correctly classified documents to the total number of documents is called accuracy. <br />
<br />
The common evaluation metrics for text classification are precision, recall and F-1 scores. Charu C Aggarwal and others defines these metrics as follows:<br />
<br />
'''Precision:''' The fraction of the correct instances among the identified positive instances. <br />
<br />
'''Recall:''' The percentage of correct instances among all the positive instances.<br />
<br />
'''F-1 score:''' The geometric mean of precision and recall.<br />
<math>F1 = 2 × \frac{precision × recall}{precision + recall}</math><br />
<br />
=== Naive Bayes Classifier ===<br />
<br />
'''Definition:'''<br />
The Naive Bayes classifier is a simple yet widely used classifier. It makes assumptions about how the data (in our case words in documents) are generated and propose a probabilistic<br />
model based on these assumptions. Then use a set of training examples to estimate the parameters of the model. Bayes rule is used to classify new examples and select the class that is most likely has generated the example. There are two main models commonly used for naive Bayes classifications: Multi-variate Bernoulli Model and Multinomial Model. Both models try to find the posterior probability<br />
of a class, based on the distribution of the words in the document. However, Multinomial Model takes into account the frequency of the words whereas the Multi-variate Bernoulli Model does<br />
not.<br />
<br />
=== Nearest Neighbor Classifier ===<br />
The main idea is that documents which belong to the same class are more likely “similar” or close to each other based on the similarity measures. The classification of the test document is inferred from the class labels of the similar documents in the training set. If we consider the k-nearest neighbor in the training data set, the approach is called k-nearest neighbor classification and the most common class from these k neighbors is reported as the class label.<br />
<br />
=== Decision Tree Classifiers ===<br />
Decision tree is basically a hierarchical tree of the training instances, in which a condition on the attribute value is used to divide the data hierarchically.<br />
<br />
In case of text data, the conditions on the decision tree nodes are commonly defined in terms of terms in the text documents. For instance a node may be subdivided to its children relying on the presence or absence of a particular term in the document.<br />
<br />
=== Support Vector Machines ===<br />
Support Vector Machines (SVM) are a form of Linear Classifiers. In the context of text documents, Linear Classifiers are models that make a classification decision based on the value of the linear combinations of the documents features. The output of a linear predictor is defined to be <math>y = \vec{a} \cdot \vec{x} + b</math>, where <math>\vec{x} = (x1, x2, . . . , xn)</math> is the<br />
normalized document word frequency vector, <math>\vec{a} = (a1, a2, . . . , an)</math> is vector of coefficients and b is a scalar. We can interpret the predictor <math>y = \vec{a} \cdot \vec{x} + b</math> in the categorical class labels as a separating hyperplane between different classes.<br />
<br />
One advantage of the SVM method is that, it is quite robust to high dimensionality, i.e. learning is almost independent of the dimensionality of the feature space. It rarely needs feature selection<br />
since it selects data points (support vectors) required for the classification. According to Joachims and others, text data is an ideal choice for SVM classification due to sparse high dimensional<br />
nature of the text with few irrelevant features. SVM methods have been widely used in many application domains such as pattern recognition, face detection and spam filtering.<br />
<br />
== Clustering == <br />
<br />
Clustering is the task of grouping objects in a collection using a similarity function. Clustering in text mining can be in different levels such as documents, paragraphs, sentences or terms.<br />
Clustering helps enhance retrieval and browsing and is applied in many fields including classification, visualization and document organization. For example, clustering can be used to produce a table of contents or to construct a context-based retrieval systems. Various software programs such as Lemur and BOW have implementations of common clustering algorithms.<br />
<br />
A simple algorithm of clustering is representing text documents as binary vectors using the presence or absence of words. However, algorithms like this are insufficient for text representing due to documents having the following unique properties: Text representation usually has a large dimensionality but sparse underlying data; Word correlation in each text piece is generally very strong and thus should be taken into account in the clustering algorithm; Normalizing should also be included in the clustering step since documents are often of very different sizes.<br />
<br />
Common algorithms of clustering include:<br />
=== Hierarchical Clustering algorithms ===<br />
<br />
Top-down(divisive) hierarchical clustering begins with one cluster and recursively split it into sub-clusters. Bottom-up(agglomerative) hierarchical starts with each data point as a single cluster and successively merge clusters until all data points are in a single cluster.<br />
<br />
=== K-means Clustering ===<br />
<br />
Given a document set D and a similarity measure S, first select k clusters and randomly select their centroids, then recursively assign documents to clusters with the closest centroids and recalculate the centroids by taking the mean of all documents in each cluster.<br />
<br />
=== Probabilistic Clustering and Topic Models ===<br />
Topic modeling constructs a probabilistic generative model over text corpora. Latent Dirichlet Allocation is a state of the art unsupervised model that is a three-level hierarchical Bayesian model in which documents are represented as random mixtures over latent topics and each topic is characterized by a distribution over words.<br />
<br />
== Information Extraction == <br />
<br />
== Text Mining in Biomedical Domain ==<br />
The amount of biomedical literature is growing dramatically, which disturbs biomedical scientists with following up new discoveries and dealing with biomedical experimental data. For example, there is thousands of millions of academic terms in biomedical domain, and recognizing these long terms and relating them to real entity are necessary when considering using machine-learning method to facilitate biomedical researches. Therefore, applications of text mining in biomedical domain become inevitable. Recently, text mining methods have been utilized in a variety of biomedical domains such as protein structure prediction, gene clustering, biomedical hypothesis and clinical diagnosis, to name a few.<br />
<br />
Information Extraction, aforementioned, is a useful preprocessing step to extract structured information from scientific articles in biomedical literature. For instance, Named-Entity Recognition(NER) often used to link entities to formal terms in biomedical domain. Also, Relation Extraction makes it possible for several entities to set complex but clear relationships with each other. In particular, NER methods are usually grouped into several approaches: dictionary-based approach, rule-based approach and statistical approaches.<br />
<br />
Summarization is a common biomedical text mining task using information extraction method. It aims at identifying the significant aspects of one or more documents and represent them in a coherent fashion automatically. In addition, question answering is another biomedical text mining task where significantly exploits information extraction methods. It is defined as the process of producing accurate answers to questions posed by humans in a natural language.<br />
<br />
== Conclusion ==<br />
<br />
== References ==<br />
* [https://doi.org/10.1145/133160.133214] Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. 1992. Scatter/Gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '92), Nicholas Belkin, Peter Ingwersen, and Annelise Mark Pejtersen (Eds.). ACM, New York, NY, USA, 318-329. DOI: https://doi.org/10.1145/133160.133214<br />
* [https://doi.org/10.1002/(SICI)1097-4571(199009)41:6&#60;391::AID-ASI1&#62;3.0.CO;2-9] Deerwester, S. , Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R. (1990), Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci., 41: 391-407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9<br />
* [https://dl.acm.org/citation.cfm?id=3001354] Ronen Feldman and Ido Dagan. 1995. Knowledge discovery in Textual Databases (KDT). In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD'95), Usama Fayyad and Ramasamy Uthurusamy (Eds.). AAAI Press 112-117.<br />
* [https://arxiv.org/abs/1707.02919] Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919.</div>
Ajylam
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Brief_Survey_of_Text_Mining:_Classification,_Clustering_and_Extraction_Techniques&diff=40371
A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques
2018-11-20T17:31:36Z
<p>Ajylam: /* References */</p>
<hr />
<div>Text mining is the process of extracting meaningful information from text that is either structured (ie. databases), semi-structured (ie. XML and JSON files) or unstructured (ie. word documents, videos and images). <br />
This paper discusses the various methods essential for the text mining field, from preprocessing and classification to clustering and extraction techniques, and also touches on applications of text mining in the biomedical and health domains. <br />
<br />
Text preprocessing is a key component in various text mining algorithms and can affect the resulting accuracy of classification models. It encodes the text in a numerical way so that various classification models and clustering methods can be used on the data. The most common method of text representation is the vector space model, and while this is a simple model, it enables efficient analysis of large collections of documents. <br />
<br />
The classification models used in the context of text mining aims to assign predefined classes to text documents, and some of the various models used include Naive Bayes, Nearest Neighbour, Decision Tree and Support Vector Machines (SVM). Clustering also has a wide range of applications in the context of text mining, including classification, visualization and document organization. Naive clustering methods usually do not work well for text data because it has distinct characteristics which require algorithms designed specifically for text data. The most popular text clustering algorithms used are hierarchical clustering, k-means clustering and probabilistic clustering (ie. topic modelling), but there are always tradeoffs between effectiveness and efficiency. <br />
<br />
Information extraction (IE) is another critical aspect of text mining as it automatically extracts structured information from unstructured or semi-structured text. It is essentially a kind of "supervised" form of natural language processing where the information we are looking for is known beforehand. The first part of information extraction is named entity recognition (NER) which locates and classifies named entities in free text into predefined categories, and the second part is relation extraction which seeks and locates the semantic relations between entities in text documents. Common models used for NER include Hidden Markov model (HMM) and Conditional random fields (CRF).<br />
<br />
One of the domains where text mining is frequently used is biomedical sciences. Due to the exponential growth in biomedical literature, it is difficult for biomedical scientists to keep up with relevant publications in their own research area. Therefore, text mining methods and machine learning algorithms are widely used to overcome the information overload.<br />
<br />
<br />
== Presented by ==<br />
Qi Chu, Xiaoran Huang, Di Sang, Amanda Lam, Yan Jiao, Shuyue Wang, Yutong Wu<br />
== Background == <br />
There is a tremendous amount of text data in various forms created from social networks, patient records, healthcare insurance data, news outlets and more. Unstructured text is easily processed and analyzed by humans, but it is significantly harder for machines to understand. However, the large volume of text produced is still an invaluable source of information, so there is a desperate need for text mining algorithms in a wide variety of applications and domains that can effectively automate the processing of large amounts of text. <br />
<br />
[ insert example ] <br />
<br />
[ insert history ] <br />
<br />
== Text Representation And Encoding == <br />
<br />
=== Vector Space Model ===<br />
The most common way to represent documents is to convert them into numeric vectors. This representation is called "Vector Space Model" (VSM). VSM is broadly used in various text mining algorithms and IR systems and enables efficient analysis of large collection of documents. In order to allow for more formal descriptions of the algorithms, we first define some terms and variables that will be frequently used in the following: Given a collection of documents <math>D = ( d_1 ,d_2 , \dotsc ,d_D )</math>, let <math>V = ( w_1 ,w_2 , \dotsc ,w_v )</math> be the set of distinct words/terms in the collection. Then <math>V</math> is called the ''vocabulary''. The ''frequency'' of the term <math>w &isin;V</math> in document <math>d &isin;D</math> is shown by <math>f_D(w)</math>. The term vector for document <math>d</math> is denoted by <math> \vec t_d\ = (f_d(w_1), f_d(w_2), \dotsc, f_d(w_v))</math>.<br />
<br />
<br />
In VSM each word is represented by a variable having a numeric value indicating the ''weight'' (importance) of the word in the document. There are two main term weight models: ''1)Boolean model'' : In this model a weight <math>w_{ij}>0</math> is assigned to each term <math>w_i &isin;d_j</math>. For any term that does not appear in <math>d_j, w_{ij} = 0</math>. ''2)Term frequency-inverse document frequency'' (TF-IDF):The most popular term weighting schemes is TF-IDF. Let <math>q</math> be this term weighting scheme, then the weight of each word <math>w_i &isin;d</math> is computed as follows:<br />
<br />
<math>q(w) = f_d(w)*log\frac{\left\vert D \right\vert}{f_D(w)}</math><br />
<br />
where <math>\left\vert D \right\vert</math> is the number of documents in the collection <math>D</math>.<br />
<br />
<br />
In TF-IDF the term frequency is normalized by ''inverse document frequency'', IDF. This normalization decreases theweight of the terms occurring more frequently in the document collection, Making sure that the matching of documents be more effected by distinctive words which have relatively low frequencies in the collection.<br />
<br />
Based on the term weighting scheme, each document is represented by a vector of term weights <math>w(d) = (w(d,w_1), w(d,w_2), \dotsc, w(d,w_v))</math>. We can compute the similarity between two documents <math>d_1</math> and <math>d_2</math>. One of the most widely used similarity measures is cosine similarity and is computed as follows:<br />
<br />
<math>S(d_1,d_2) = \cos \theta = \frac{\displaystyle d_1 · d_2}{{\displaystyle \sqrt {\sum_{i=1}^v w_{1i}^2}} · {\displaystyle \sqrt {\sum_{i=1}^v w_{2i}^2}}}</math><br />
<br />
<br />
Preprocessing is one of the key components in many text mining algorithms. It usually consists of the tasks such as tokenization, filtering, lemmatization and stemming. These four techniques are widely used in NLP.<br />
<br />
Tokenization: Tokenization is the task of breaking a character sequence up into pieces (words/phrases) called tokens, and perhaps at the same time throw away certain characters such as punctuation marks. The list of tokens then is used to further processing.<br />
<br />
[insert example] <br />
<br />
Filtering: Filtering is usually done on documents to remove some of the words. A common filtering is stop-words removal. Stop words are the words frequently appear in the text without having much content information (e.g. prepositions, conjunctions, etc). Similarly words occurring quite often in the text said to have little information to distinguish different documents and also words occurring very rarely are also possibly of no significant relevance and can be removed from the documents.<br />
<br />
[insert example]<br />
<br />
Lemmatization: Lemmatization is the task that considers the morphological analysis of the words, i.e. grouping together the various inflected forms of a word so they can be analyzed as a single item. In other words lemmatization methods try to map verb forms to infinite tense and nouns to a single form. In order to lemmatize the documents we first must specify the POS of each word of the documents and because POS is tedious and error prone, in practice stemming methods are preferred.<br />
<br />
[insert example]<br />
<br />
Stemming: Stemming methods aim at obtaining stem (root) of derived words. Stemming algorithms are indeed language dependent. The first stemming algorithm introduced in 1968. The most widely stemming method used in English is introduced in 1980.<br />
<br />
[talk about the two stemmers]<br />
<br />
== Classification ==<br />
=== Problem Statement ===<br />
The problem of classification is defined as follows. <br />
We have a training set <math>D = {d_1,d_2, . . . ,d_n }</math> of documents, such that each document di is labeled with a label ℓi from the set <math>L = {ℓ_1, ℓ_2, . . . , ℓ_k }</math>. <br />
The task is to find a classification model (classifier) <math>f</math> where <math>f : D → L, f (d) = ℓ</math>.<br />
<br />
To evaluate the performance of the classification model, we set aside a test set. After training the classifier with training set, we classify the test set and the portion of correctly classified documents to the total number of documents is called accuracy. <br />
<br />
The common evaluation metrics for text classification are precision, recall and F-1 scores. Charu C Aggarwal and others defines these metrics as follows:<br />
<br />
'''Precision:''' The fraction of the correct instances among the identified positive instances. <br />
<br />
'''Recall:''' The percentage of correct instances among all the positive instances.<br />
<br />
'''F-1 score:''' The geometric mean of precision and recall.<br />
<math>F1 = 2 × \frac{precision × recall}{precision + recall}</math><br />
<br />
=== Naive Bayes Classifier ===<br />
<br />
'''Definition:'''<br />
The Naive Bayes classifier is a simple yet widely used classifier. It makes assumptions about how the data (in our case words in documents) are generated and propose a probabilistic<br />
model based on these assumptions. Then use a set of training examples to estimate the parameters of the model. Bayes rule is used to classify new examples and select the class that is most likely has generated the example. There are two main models commonly used for naive Bayes classifications: Multi-variate Bernoulli Model and Multinomial Model. Both models try to find the posterior probability<br />
of a class, based on the distribution of the words in the document. However, Multinomial Model takes into account the frequency of the words whereas the Multi-variate Bernoulli Model does<br />
not.<br />
<br />
=== Nearest Neighbor Classifier ===<br />
The main idea is that documents which belong to the same class are more likely “similar” or close to each other based on the similarity measures. The classification of the test document is inferred from the class labels of the similar documents in the training set. If we consider the k-nearest neighbor in the training data set, the approach is called k-nearest neighbor classification and the most common class from these k neighbors is reported as the class label.<br />
<br />
=== Decision Tree Classifiers ===<br />
Decision tree is basically a hierarchical tree of the training instances, in which a condition on the attribute value is used to divide the data hierarchically.<br />
<br />
In case of text data, the conditions on the decision tree nodes are commonly defined in terms of terms in the text documents. For instance a node may be subdivided to its children relying on the presence or absence of a particular term in the document.<br />
<br />
=== Support Vector Machines ===<br />
Support Vector Machines (SVM) are a form of Linear Classifiers. In the context of text documents, Linear Classifiers are models that make a classification decision based on the value of the linear combinations of the documents features. The output of a linear predictor is defined to be <math>y = \vec{a} \cdot \vec{x} + b</math>, where <math>\vec{x} = (x1, x2, . . . , xn)</math> is the<br />
normalized document word frequency vector, <math>\vec{a} = (a1, a2, . . . , an)</math> is vector of coefficients and b is a scalar. We can interpret the predictor <math>y = \vec{a} \cdot \vec{x} + b</math> in the categorical class labels as a separating hyperplane between different classes.<br />
<br />
One advantage of the SVM method is that, it is quite robust to high dimensionality, i.e. learning is almost independent of the dimensionality of the feature space. It rarely needs feature selection<br />
since it selects data points (support vectors) required for the classification. According to Joachims and others, text data is an ideal choice for SVM classification due to sparse high dimensional<br />
nature of the text with few irrelevant features. SVM methods have been widely used in many application domains such as pattern recognition, face detection and spam filtering.<br />
<br />
== Clustering == <br />
<br />
Clustering is the task of grouping objects in a collection using a similarity function. Clustering in text mining can be in different levels such as documents, paragraphs, sentences or terms.<br />
Clustering helps enhance retrieval and browsing and is applied in many fields including classification, visualization and document organization. For example, clustering can be used to produce a table of contents or to construct a context-based retrieval systems. Various software programs such as Lemur and BOW have implementations of common clustering algorithms.<br />
<br />
A simple algorithm of clustering is representing text documents as binary vectors using the presence or absence of words. However, algorithms like this are insufficient for text representing due to documents having the following unique properties: Text representation usually has a large dimensionality but sparse underlying data; Word correlation in each text piece is generally very strong and thus should be taken into account in the clustering algorithm; Normalizing should also be included in the clustering step since documents are often of very different sizes.<br />
<br />
Common algorithms of clustering include:<br />
=== Hierarchical Clustering algorithms ===<br />
<br />
Top-down(divisive) hierarchical clustering begins with one cluster and recursively split it into sub-clusters. Bottom-up(agglomerative) hierarchical starts with each data point as a single cluster and successively merge clusters until all data points are in a single cluster.<br />
<br />
=== K-means Clustering ===<br />
<br />
Given a document set D and a similarity measure S, first select k clusters and randomly select their centroids, then recursively assign documents to clusters with the closest centroids and recalculate the centroids by taking the mean of all documents in each cluster.<br />
<br />
=== Probabilistic Clustering and Topic Models ===<br />
Topic modeling constructs a probabilistic generative model over text corpora. Latent Dirichlet Allocation is a state of the art unsupervised model that is a three-level hierarchical Bayesian model in which documents are represented as random mixtures over latent topics and each topic is characterized by a distribution over words.<br />
<br />
== Information Extraction == <br />
<br />
== Text Mining in Biomedical Domain ==<br />
The amount of biomedical literature is growing dramatically, which disturbs biomedical scientists with following up new discoveries and dealing with biomedical experimental data. For example, there is thousands of millions of academic terms in biomedical domain, and recognizing these long terms and relating them to real entity are necessary when considering using machine-learning method to facilitate biomedical researches. Therefore, applications of text mining in biomedical domain become inevitable. Recently, text mining methods have been utilized in a variety of biomedical domains such as protein structure prediction, gene clustering, biomedical hypothesis and clinical diagnosis, to name a few.<br />
<br />
Information Extraction, aforementioned, is a useful preprocessing step to extract structured information from scientific articles in biomedical literature. For instance, Named-Entity Recognition(NER) often used to link entities to formal terms in biomedical domain. Also, Relation Extraction makes it possible for several entities to set complex but clear relationships with each other. In particular, NER methods are usually grouped into several approaches: dictionary-based approach, rule-based approach and statistical approaches.<br />
<br />
Summarization is a common biomedical text mining task using information extraction method. It aims at identifying the significant aspects of one or more documents and represent them in a coherent fashion automatically. In addition, question answering is another biomedical text mining task where significantly exploits information extraction methods. It is defined as the process of producing accurate answers to questions posed by humans in a natural language.<br />
<br />
== Conclusion ==<br />
<br />
== References ==<br />
{{reflist}}</div>
Ajylam
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Brief_Survey_of_Text_Mining:_Classification,_Clustering_and_Extraction_Techniques&diff=39845
A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques
2018-11-18T23:52:26Z
<p>Ajylam: </p>
<hr />
<div>Text mining is the process of extracting meaningful information from text that is either structured (ie. databases), semi-structured (ie. XML and JSON files) or unstructured (ie. word documents, videos and images). <br />
This paper discusses the various methods essential for the text mining field, from preprocessing and classification to clustering and extraction techniques, and also touches on applications of text mining in the biomedical and health domains. <br />
<br />
Text preprocessing is a key component in various text mining algorithms and can affect the resulting accuracy of classification models. It encodes the text in a numerical way so that various classification models and clustering methods can be used on the data. The most common method of text representation is the vector space model, and while this is a simple model, it enables efficient analysis of large collections of documents. <br />
<br />
The classification models used in the context of text mining aims to assign predefined classes to text documents, and some of the various models used include Naive Bayes, Nearest Neighbour, Decision Tree and Support Vector Machines (SVM). Clustering also has a wide range of applications in the context of text mining, including classification, visualization and document organization. Naive clustering methods usually do not work well for text data because it has distinct characteristics which require algorithms designed specifically for text data. The most popular text clustering algorithms used are hierarchical clustering, k-means clustering and probabilistic clustering (ie. topic modelling), but there are always tradeoffs between effectiveness and efficiency. <br />
<br />
Information extraction (IE) is another critical aspect of text mining as it automatically extracts structured information from unstructured or semi-structured text. It is essentially a kind of "supervised" form of natural language processing where the information we are looking for is known beforehand. The first part of information extraction is named entity recognition (NER) which locates and classifies named entities in free text into predefined categories, and the second part is relation extraction which seeks and locates the semantic relations between entities in text documents. Common models used for NER include Hidden Markov model (HMM) and Conditional random fields (CRF).<br />
<br />
One of the domains where text mining is frequently used is biomedical sciences. Due to the exponential growth in biomedical literature, it is difficult for biomedical scientists to keep up with relevant publications in their own research area. Therefore, text mining methods and machine learning algorithms are widely used to overcome the information overload.<br />
<br />
<br />
== Background == <br />
There is a tremendous amount of text data in various forms created from social networks, patient records, healthcare insurance data, news outlets and more. Unstructured text is easily processed and analyzed by humans, but it is significantly harder for machines to understand. However, the large volume of text produced is still an invaluable source of information, so there is a desperate need for text mining algorithms in a wide variety of applications and domains that can effectively automate the processing of large amounts of text. <br />
<br />
[ insert example ] <br />
<br />
[ insert history ] <br />
<br />
== Text Preprocessing == <br />
<br />
=== Vector Space Model === <br />
<br />
== Classification ==<br />
<br />
== Clustering == <br />
<br />
== Information Extraction == <br />
<br />
== Biomedical Ontologies ==<br />
<br />
== Conclusion ==<br />
<br />
== References ==</div>
Ajylam
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Brief_Survey_of_Text_Mining:_Classification,_Clustering_and_Extraction_Techniques&diff=39844
A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques
2018-11-18T23:42:10Z
<p>Ajylam: </p>
<hr />
<div>Text mining is the process of extracting meaningful information from text that is either structured (ie. databases), semi-structured (ie. XML and JSON files) or unstructured (ie. word documents, videos and images). <br />
This paper discusses the various methods essential for the text mining field, from preprocessing and classification to clustering and extraction techniques, and also touches on applications of text mining in the biomedical and health domains. <br />
<br />
Text preprocessing is a key component in various text mining algorithms and can affect the resulting accuracy of classification models. It encodes the text in a numerical way so that various classification models and clustering methods can be used on the data. The most common method of text representation is the vector space model, and while this is a simple model, it enables efficient analysis of large collections of documents. <br />
<br />
The classification models used in the context of text mining aims to assign predefined classes to text documents, and some of the various models used include Naive Bayes, Nearest Neighbour, Decision Tree and Support Vector Machines (SVM). Clustering also has a wide range of applications in the context of text mining, including classification, visualization and document organization. Naive clustering methods usually do not work well for text data because it has distinct characteristics which require algorithms designed specifically for text data. The most popular text clustering algorithms used are hierarchical clustering, k-means clustering and probabilistic clustering (ie. topic modelling), but there are always tradeoffs between effectiveness and efficiency. <br />
<br />
Information extraction (IE) is another critical aspect of text mining as it automatically extracts structured information from unstructured or semi-structured text. It is essentially a kind of "supervised" form of natural language processing where the information we are looking for is known beforehand. The first part of information extraction is named entity recognition (NER) which locates and classifies named entities in free text into predefined categories, and the second part is relation extraction which seeks and locates the semantic relations between entities in text documents. Common models used for NER include Hidden Markov model (HMM) and Conditional random fields (CRF).<br />
<br />
One of the domains where text mining is frequently used is biomedical sciences. Due to the exponential growth in biomedical literature, it is difficult for biomedical scientists to keep up with relevant publications in their own research area. Therefore, text mining methods and machine learning algorithms are widely used to overcome the information overload.<br />
<br />
<br />
== Background and Motivation== <br />
<br />
== Text Preprocessing == <br />
<br />
=== Vector Space Model === <br />
<br />
== Classification ==<br />
<br />
== Clustering == <br />
<br />
== Information Extraction == <br />
<br />
== Biomedical Ontologies ==<br />
<br />
== Conclusion ==<br />
<br />
== References ==</div>
Ajylam
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Brief_Survey_of_Text_Mining:_Classification,_Clustering_and_Extraction_Techniques&diff=39810
A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques
2018-11-18T22:06:32Z
<p>Ajylam: </p>
<hr />
<div>Text mining is the process of extracting meaningful information from text that is either structured (ie. databases), semi-structured (ie. XML and JSON files) or unstructured (ie. word documents, videos and images). <br />
This paper discusses the various methods essential for the text mining field, from preprocessing and classification to clustering and extraction techniques, and also touches on applications of text mining in the biomedical and health domains. <br />
<br />
Text preprocessing is a key component in various text mining algorithms and can affect the resulting accuracy of classification models. It encodes the text in a numerical way so that various classification models and clustering methods can be used on the data. The most common method of text representation is the vector space model, and while this is a simple model, it enables efficient analysis of large collections of documents. <br />
<br />
The classification models used in the context of text mining aims to assign predefined classes to text documents, and some of the various models used include Naive Bayes, Nearest Neighbour, Decision Tree and Support Vector Machines (SVM). Clustering also has a wide range of applications in the context of text mining, including classification, visualization and document organization. Naive clustering methods usually do not work well for text data because it has distinct characteristics which require algorithms designed specifically for text data. The most popular text clustering algorithms used are hierarchical clustering, k-means clustering and probabilistic clustering (ie. topic modelling), but there are always tradeoffs between effectiveness and efficiency. <br />
<br />
Information extraction (IE) is another critical aspect of text mining as it automatically extracts structured information from unstructured or semi-structured text. It is essentially a kind of "supervised" form of natural language processing where the information we are looking for is known beforehand. The first part of information extraction is named entity recognition (NER) which locates and classifies named entities in free text into predefined categories, and the second part is relation extraction which seeks and locates the semantic relations between entities in text documents. Common models used for NER include Hidden Markov model (HMM) and Conditional random fields (CRF).<br />
<br />
One of the domains where text mining is frequently used is biomedical sciences. Due to the exponential growth in biomedical literature, it is difficult for biomedical scientists to keep up with relevant publications in their own research area. Therefore, text mining methods and machine learning algorithms are widely used to overcome the information overload.<br />
<br />
<br />
== Background == <br />
<br />
== Text Preprocessing == <br />
<br />
=== Vector Space Model === <br />
<br />
== Classification ==<br />
<br />
== Clustering == <br />
<br />
== Information Extraction == <br />
<br />
== Biomedical Ontologies ==<br />
<br />
== Conclusion ==<br />
<br />
== References ==</div>
Ajylam
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18&diff=39788
stat441F18
2018-11-18T20:37:38Z
<p>Ajylam: </p>
<hr />
<div><br />
<br />
== [[F18-STAT841-Proposal| Project Proposal ]] ==<br />
<br />
[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
<br />
<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Nov 13 || Jason Schneider, Jordyn Walton, Zahraa Abbas, Andrew Na || 1|| Memory-Based Parameter Adaptation || [https://arxiv.org/pdf/1802.10542.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Memory-Based_Parameter_Adaptation#Incremental_Learning Summary]<br />
|-<br />
|Nov 13 ||Sai Praneeth M, Xudong Peng, Alice Li, Shahrzad Hosseini Vajargah|| 2|| Going Deeper with Convolutions ||[https://arxiv.org/pdf/1409.4842.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary]<br />
|-<br />
|NOv 15 || Yan Yu Chen, Qisi Deng, Hengxin Li, Bochao Zhang|| 3|| Topic Compositional Neural Language Model|| [https://arxiv.org/pdf/1712.09783.pdf paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM Summary]<br />
|-<br />
|Nov 15 || Zhaoran Hou, Pei Wei Wang, Chi Zhang, Yiming Li, Daoyi Chen, Ying Chi|| 4|| Extreme Learning Machine for regression and Multi-class Classification|| [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6035797 Paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841F18/ Summary]<br />
|-<br />
|NOv 20 || Kristi Brewster, Isaac McLellan, Ahmad Nayar Hassan, Marina Medhat Rassmi Melek, Brendan Ross, Jon Barenboim, Junqiao Lin, James Bootsma || 5|| A Neural Representation of Sketch Drawings || [https://arxiv.org/pdf/1704.03477.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_-_A_Neural_Representation_of_Sketch_Drawings Summary] <br />
|-<br />
|Nov 20 || Maya(Mahdiyeh) Bayati, Saber Malekmohammadi, Vincent Loung || 6|| Convolutional Neural Networks for Sentence Classification || [https://arxiv.org/pdf/1408.5882.pdf paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Convolutional_Neural_Networks_for_Sentence_Classi%EF%AC%81cation Summary] <br />
|-<br />
|NOv 22 || Qingxi Huo, Yanmin Yang, Jiaqi Wang, Yuanjing Cai, Colin Stranc, Philomène Bobichon, Aditya Maheshwari, Zepeng An || 7|| Robust Probabilistic Modeling with Bayesian Data Reweighting || [http://proceedings.mlr.press/v70/wang17g/wang17g.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robust_Probabilistic_Modeling_with_Bayesian_Data_Reweighting Summary]<br />
|-<br />
|Nov 22 || Hanzhen Yang, Jing Pu Sun, Ganyuan Xuan, Yu Su, Jiacheng Weng, Keqi Li, Yi Qian, Bomeng Liu || 8|| Deep Residual Learning for Image Recognition || [http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Residual_Learning_for_Image_Recognition Summary]<br />
|-<br />
|NOv 27 || Mitchell Snaith || 9|| You Only Look Once: Unified, Real-Time Object Detection, V1 -> V3 || [https://arxiv.org/pdf/1506.02640.pdf Paper] || <br />
|-<br />
|Nov 27 || Qi Chu, Gloria Huang, Di Sang, Amanda Lam, Yan Jiao, Shuyue Wang, Yutong Wu, Shikun Cui || 10|| A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques || [https://arxiv.org/pdf/1707.02919.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Brief_Survey_of_Text_Mining:_Classification,_Clustering_and_Extraction_Techniques Summary]<br />
|-<br />
|NOv 29 || Jameson Ngo, Amy Xu, Aden Grant, Yu Hao Wang, Andrew McMurry, Baizhi Song, Yongqi Dong || 11|| Towards Deep Learning Models Resistant to Adversarial Attacks || [https://arxiv.org/pdf/1706.06083.pdf Paper] || <br />
|-<br />
|Nov 29 || Qianying Zhao, Hui Huang, Lingyun Yi, Jiayue Zhang, Siao Chen, Rongrong Su, Gezhou Zhang, Meiyu Zhou || 12|| XGBoost: A Scalable Tree Boosting System || [http://delivery.acm.org/10.1145/2940000/2939785/p785-chen.pdf?ip=129.97.124.2&id=2939785&acc=CHORUS&key=FD0067F557510FFB%2E9219CF56F73DCF78%2E4D4702B0C3E38B35%2E6D218144511F3437&__acm__=1542321481_ffea42f38a2b3325af4990280553c10f Paper] ||<br />
|-<br />
|Nov 28 || Hudson Ash, Stephen Kingston, Richard Zhang, Alexandre Xiao, Ziqiu Zhu || 13 || Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness || [https://arxiv.org/pdf/1608.05842.pdf Paper] ||<br />
|-<br />
|Nov 21 || Frank Jiang, Yuan Zhang, Jerry Hu || 14 || Distributed Representations of Words and Phrases and their Compositionality || [https://arxiv.org/pdf/1310.4546.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality Summary]<br />
|-<br />
|Nov 21 || Yu Xuan Lee, Tsen Yee Heng || 15 || Gradient Episodic Memory for Continual Learning || [http://papers.nips.cc/paper/7225-gradient-episodic-memory-for-continual-learning.pdf Paper] || <br />
|-<br />
|Nov 28 || Ben Zhang, Rees Simmons, Sunil Mall || 16 || Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift || [https://arxiv.org/pdf/1502.03167.pdf Paper] ||</div>
Ajylam
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Brief_Survey_of_Text_Mining:_Classification,_Clustering_and_Extraction_Techniques&diff=39787
A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques
2018-11-18T20:36:17Z
<p>Ajylam: </p>
<hr />
<div>[ insert introduction here ]<br />
<br />
== Background == <br />
<br />
== Text Preprocessing == <br />
<br />
=== Vector Space Model === <br />
<br />
== Classification ==<br />
<br />
== Clustering == <br />
<br />
== Information Extraction == <br />
<br />
== Biomedical ontologies ==<br />
<br />
== Conclusion ==<br />
<br />
== References ==</div>
Ajylam
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Text_Mining_Classification_Clustering_Extraction_Techniques&diff=39786
Text Mining Classification Clustering Extraction Techniques
2018-11-18T20:35:53Z
<p>Ajylam: Ajylam moved page Text Mining Classification Clustering Extraction Techniques to A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques</p>
<hr />
<div>#REDIRECT [[A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques]]</div>
Ajylam
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Brief_Survey_of_Text_Mining:_Classification,_Clustering_and_Extraction_Techniques&diff=39785
A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques
2018-11-18T20:35:53Z
<p>Ajylam: Ajylam moved page Text Mining Classification Clustering Extraction Techniques to A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques</p>
<hr />
<div></div>
Ajylam
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Brief_Survey_of_Text_Mining:_Classification,_Clustering_and_Extraction_Techniques&diff=39784
A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques
2018-11-18T20:34:54Z
<p>Ajylam: Blanked the page</p>
<hr />
<div></div>
Ajylam
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18&diff=39783
stat441F18
2018-11-18T20:34:31Z
<p>Ajylam: </p>
<hr />
<div><br />
<br />
== [[F18-STAT841-Proposal| Project Proposal ]] ==<br />
<br />
[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
<br />
<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Nov 13 || Jason Schneider, Jordyn Walton, Zahraa Abbas, Andrew Na || 1|| Memory-Based Parameter Adaptation || [https://arxiv.org/pdf/1802.10542.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Memory-Based_Parameter_Adaptation#Incremental_Learning Summary]<br />
|-<br />
|Nov 13 ||Sai Praneeth M, Xudong Peng, Alice Li, Shahrzad Hosseini Vajargah|| 2|| Going Deeper with Convolutions ||[https://arxiv.org/pdf/1409.4842.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary]<br />
|-<br />
|NOv 15 || Yan Yu Chen, Qisi Deng, Hengxin Li, Bochao Zhang|| 3|| Topic Compositional Neural Language Model|| [https://arxiv.org/pdf/1712.09783.pdf paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM Summary]<br />
|-<br />
|Nov 15 || Zhaoran Hou, Pei Wei Wang, Chi Zhang, Yiming Li, Daoyi Chen, Ying Chi|| 4|| Extreme Learning Machine for regression and Multi-class Classification|| [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6035797 Paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841F18/ Summary]<br />
|-<br />
|NOv 20 || Kristi Brewster, Isaac McLellan, Ahmad Nayar Hassan, Marina Medhat Rassmi Melek, Brendan Ross, Jon Barenboim, Junqiao Lin, James Bootsma || 5|| A Neural Representation of Sketch Drawings || [https://arxiv.org/pdf/1704.03477.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_-_A_Neural_Representation_of_Sketch_Drawings Summary] <br />
|-<br />
|Nov 20 || Maya(Mahdiyeh) Bayati, Saber Malekmohammadi, Vincent Loung || 6|| Convolutional Neural Networks for Sentence Classification || [https://arxiv.org/pdf/1408.5882.pdf paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Convolutional_Neural_Networks_for_Sentence_Classi%EF%AC%81cation Summary] <br />
|-<br />
|NOv 22 || Qingxi Huo, Yanmin Yang, Jiaqi Wang, Yuanjing Cai, Colin Stranc, Philomène Bobichon, Aditya Maheshwari, Zepeng An || 7|| Robust Probabilistic Modeling with Bayesian Data Reweighting || [http://proceedings.mlr.press/v70/wang17g/wang17g.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robust_Probabilistic_Modeling_with_Bayesian_Data_Reweighting Summary]<br />
|-<br />
|Nov 22 || Hanzhen Yang, Jing Pu Sun, Ganyuan Xuan, Yu Su, Jiacheng Weng, Keqi Li, Yi Qian, Bomeng Liu || 8|| Deep Residual Learning for Image Recognition || [http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Residual_Learning_for_Image_Recognition Summary]<br />
|-<br />
|NOv 27 || Mitchell Snaith || 9|| You Only Look Once: Unified, Real-Time Object Detection, V1 -> V3 || [https://arxiv.org/pdf/1506.02640.pdf Paper] || <br />
|-<br />
|Nov 27 || Qi Chu, Gloria Huang, Di Sang, Amanda Lam, Yan Jiao, Shuyue Wang, Yutong Wu, Shikun Cui || 10|| A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques || [https://arxiv.org/pdf/1707.02919.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Brief_Survey_of_Text_Mining:_Classification,_Clustering_and_Extraction_Techniques_(Summary) Summary]<br />
|-<br />
|NOv 29 || Jameson Ngo, Amy Xu, Aden Grant, Yu Hao Wang, Andrew McMurry, Baizhi Song, Yongqi Dong || 11|| Towards Deep Learning Models Resistant to Adversarial Attacks || [https://arxiv.org/pdf/1706.06083.pdf Paper] || <br />
|-<br />
|Nov 29 || Qianying Zhao, Hui Huang, Lingyun Yi, Jiayue Zhang, Siao Chen, Rongrong Su, Gezhou Zhang, Meiyu Zhou || 12|| XGBoost: A Scalable Tree Boosting System || [http://delivery.acm.org/10.1145/2940000/2939785/p785-chen.pdf?ip=129.97.124.2&id=2939785&acc=CHORUS&key=FD0067F557510FFB%2E9219CF56F73DCF78%2E4D4702B0C3E38B35%2E6D218144511F3437&__acm__=1542321481_ffea42f38a2b3325af4990280553c10f Paper] ||<br />
|-<br />
|Nov 28 || Hudson Ash, Stephen Kingston, Richard Zhang, Alexandre Xiao, Ziqiu Zhu || 13 || Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness || [https://arxiv.org/pdf/1608.05842.pdf Paper] ||<br />
|-<br />
|Nov 21 || Frank Jiang, Yuan Zhang, Jerry Hu || 14 || Distributed Representations of Words and Phrases and their Compositionality || [https://arxiv.org/pdf/1310.4546.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality Summary]<br />
|-<br />
|Nov 21 || Yu Xuan Lee, Tsen Yee Heng || 15 || Gradient Episodic Memory for Continual Learning || [http://papers.nips.cc/paper/7225-gradient-episodic-memory-for-continual-learning.pdf Paper] || <br />
|-<br />
|Nov 28 || Ben Zhang, Rees Simmons, Sunil Mall || 16 || Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift || [https://arxiv.org/pdf/1502.03167.pdf Paper] ||</div>
Ajylam
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18&diff=39782
stat441F18
2018-11-18T20:34:08Z
<p>Ajylam: </p>
<hr />
<div><br />
<br />
== [[F18-STAT841-Proposal| Project Proposal ]] ==<br />
<br />
[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
<br />
<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Nov 13 || Jason Schneider, Jordyn Walton, Zahraa Abbas, Andrew Na || 1|| Memory-Based Parameter Adaptation || [https://arxiv.org/pdf/1802.10542.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Memory-Based_Parameter_Adaptation#Incremental_Learning Summary]<br />
|-<br />
|Nov 13 ||Sai Praneeth M, Xudong Peng, Alice Li, Shahrzad Hosseini Vajargah|| 2|| Going Deeper with Convolutions ||[https://arxiv.org/pdf/1409.4842.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary]<br />
|-<br />
|NOv 15 || Yan Yu Chen, Qisi Deng, Hengxin Li, Bochao Zhang|| 3|| Topic Compositional Neural Language Model|| [https://arxiv.org/pdf/1712.09783.pdf paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM Summary]<br />
|-<br />
|Nov 15 || Zhaoran Hou, Pei Wei Wang, Chi Zhang, Yiming Li, Daoyi Chen, Ying Chi|| 4|| Extreme Learning Machine for regression and Multi-class Classification|| [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6035797 Paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841F18/ Summary]<br />
|-<br />
|NOv 20 || Kristi Brewster, Isaac McLellan, Ahmad Nayar Hassan, Marina Medhat Rassmi Melek, Brendan Ross, Jon Barenboim, Junqiao Lin, James Bootsma || 5|| A Neural Representation of Sketch Drawings || [https://arxiv.org/pdf/1704.03477.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_-_A_Neural_Representation_of_Sketch_Drawings Summary] <br />
|-<br />
|Nov 20 || Maya(Mahdiyeh) Bayati, Saber Malekmohammadi, Vincent Loung || 6|| Convolutional Neural Networks for Sentence Classification || [https://arxiv.org/pdf/1408.5882.pdf paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Convolutional_Neural_Networks_for_Sentence_Classi%EF%AC%81cation Summary] <br />
|-<br />
|NOv 22 || Qingxi Huo, Yanmin Yang, Jiaqi Wang, Yuanjing Cai, Colin Stranc, Philomène Bobichon, Aditya Maheshwari, Zepeng An || 7|| Robust Probabilistic Modeling with Bayesian Data Reweighting || [http://proceedings.mlr.press/v70/wang17g/wang17g.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robust_Probabilistic_Modeling_with_Bayesian_Data_Reweighting Summary]<br />
|-<br />
|Nov 22 || Hanzhen Yang, Jing Pu Sun, Ganyuan Xuan, Yu Su, Jiacheng Weng, Keqi Li, Yi Qian, Bomeng Liu || 8|| Deep Residual Learning for Image Recognition || [http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Residual_Learning_for_Image_Recognition Summary]<br />
|-<br />
|NOv 27 || Mitchell Snaith || 9|| You Only Look Once: Unified, Real-Time Object Detection, V1 -> V3 || [https://arxiv.org/pdf/1506.02640.pdf Paper] || <br />
|-<br />
|Nov 27 || Qi Chu, Gloria Huang, Di Sang, Amanda Lam, Yan Jiao, Shuyue Wang, Yutong Wu, Shikun Cui || 10|| A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques || [https://arxiv.org/pdf/1707.02919.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Brief_Survey_of_Text_Mining:_Classification,_Clustering_and_Extraction_Techniques_(Summary)]<br />
|-<br />
|NOv 29 || Jameson Ngo, Amy Xu, Aden Grant, Yu Hao Wang, Andrew McMurry, Baizhi Song, Yongqi Dong || 11|| Towards Deep Learning Models Resistant to Adversarial Attacks || [https://arxiv.org/pdf/1706.06083.pdf Paper] || <br />
|-<br />
|Nov 29 || Qianying Zhao, Hui Huang, Lingyun Yi, Jiayue Zhang, Siao Chen, Rongrong Su, Gezhou Zhang, Meiyu Zhou || 12|| XGBoost: A Scalable Tree Boosting System || [http://delivery.acm.org/10.1145/2940000/2939785/p785-chen.pdf?ip=129.97.124.2&id=2939785&acc=CHORUS&key=FD0067F557510FFB%2E9219CF56F73DCF78%2E4D4702B0C3E38B35%2E6D218144511F3437&__acm__=1542321481_ffea42f38a2b3325af4990280553c10f Paper] ||<br />
|-<br />
|Nov 28 || Hudson Ash, Stephen Kingston, Richard Zhang, Alexandre Xiao, Ziqiu Zhu || 13 || Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness || [https://arxiv.org/pdf/1608.05842.pdf Paper] ||<br />
|-<br />
|Nov 21 || Frank Jiang, Yuan Zhang, Jerry Hu || 14 || Distributed Representations of Words and Phrases and their Compositionality || [https://arxiv.org/pdf/1310.4546.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality Summary]<br />
|-<br />
|Nov 21 || Yu Xuan Lee, Tsen Yee Heng || 15 || Gradient Episodic Memory for Continual Learning || [http://papers.nips.cc/paper/7225-gradient-episodic-memory-for-continual-learning.pdf Paper] || <br />
|-<br />
|Nov 28 || Ben Zhang, Rees Simmons, Sunil Mall || 16 || Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift || [https://arxiv.org/pdf/1502.03167.pdf Paper] ||</div>
Ajylam
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Brief_Survey_of_Text_Mining:_Classification,_Clustering_and_Extraction_Techniques&diff=39781
A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques
2018-11-18T20:32:55Z
<p>Ajylam: </p>
<hr />
<div>[ insert introduction here ]<br />
<br />
== Background == <br />
<br />
== Text Preprocessing == <br />
<br />
=== Vector Space Model === <br />
<br />
== Classification ==<br />
<br />
== Clustering == <br />
<br />
== Information Extraction == <br />
<br />
== Biomedical ontologies ==<br />
<br />
== Conclusion ==<br />
<br />
== References ==</div>
Ajylam
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Brief_Survey_of_Text_Mining:_Classification,_Clustering_and_Extraction_Techniques&diff=39780
A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques
2018-11-18T20:32:21Z
<p>Ajylam: Created page with "= Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques = ---- [ insert introduction here ] == Background == == Text Preprocessing == === Ve..."</p>
<hr />
<div>= Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques = <br />
----<br />
<br />
[ insert introduction here ]<br />
<br />
== Background == <br />
<br />
== Text Preprocessing == <br />
<br />
=== Vector Space Model === <br />
<br />
== Classification ==<br />
<br />
== Clustering == <br />
<br />
== Information Extraction == <br />
<br />
== Biomedical ontologies ==<br />
<br />
== Conclusion ==<br />
<br />
== References ==</div>
Ajylam
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18&diff=39779
stat441F18
2018-11-18T20:19:17Z
<p>Ajylam: </p>
<hr />
<div><br />
<br />
== [[F18-STAT841-Proposal| Project Proposal ]] ==<br />
<br />
[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
<br />
<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Nov 13 || Jason Schneider, Jordyn Walton, Zahraa Abbas, Andrew Na || 1|| Memory-Based Parameter Adaptation || [https://arxiv.org/pdf/1802.10542.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Memory-Based_Parameter_Adaptation#Incremental_Learning Summary]<br />
|-<br />
|Nov 13 ||Sai Praneeth M, Xudong Peng, Alice Li, Shahrzad Hosseini Vajargah|| 2|| Going Deeper with Convolutions ||[https://arxiv.org/pdf/1409.4842.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary]<br />
|-<br />
|NOv 15 || Yan Yu Chen, Qisi Deng, Hengxin Li, Bochao Zhang|| 3|| Topic Compositional Neural Language Model|| [https://arxiv.org/pdf/1712.09783.pdf paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM Summary]<br />
|-<br />
|Nov 15 || Zhaoran Hou, Pei Wei Wang, Chi Zhang, Yiming Li, Daoyi Chen, Ying Chi|| 4|| Extreme Learning Machine for regression and Multi-class Classification|| [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6035797 Paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841F18/ Summary]<br />
|-<br />
|NOv 20 || Kristi Brewster, Isaac McLellan, Ahmad Nayar Hassan, Marina Medhat Rassmi Melek, Brendan Ross, Jon Barenboim, Junqiao Lin, James Bootsma || 5|| A Neural Representation of Sketch Drawings || [https://arxiv.org/pdf/1704.03477.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_-_A_Neural_Representation_of_Sketch_Drawings Summary] <br />
|-<br />
|Nov 20 || Maya(Mahdiyeh) Bayati, Saber Malekmohammadi, Vincent Loung || 6|| Convolutional Neural Networks for Sentence Classification || [https://arxiv.org/pdf/1408.5882.pdf paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Convolutional_Neural_Networks_for_Sentence_Classi%EF%AC%81cation Summary] <br />
|-<br />
|NOv 22 || Qingxi Huo, Yanmin Yang, Jiaqi Wang, Yuanjing Cai, Colin Stranc, Philomène Bobichon, Aditya Maheshwari, Zepeng An || 7|| Robust Probabilistic Modeling with Bayesian Data Reweighting || [http://proceedings.mlr.press/v70/wang17g/wang17g.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robust_Probabilistic_Modeling_with_Bayesian_Data_Reweighting Summary]<br />
|-<br />
|Nov 22 || Hanzhen Yang, Jing Pu Sun, Ganyuan Xuan, Yu Su, Jiacheng Weng, Keqi Li, Yi Qian, Bomeng Liu || 8|| Deep Residual Learning for Image Recognition || [http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Residual_Learning_for_Image_Recognition Summary]<br />
|-<br />
|NOv 27 || Mitchell Snaith || 9|| You Only Look Once: Unified, Real-Time Object Detection, V1 -> V3 || [https://arxiv.org/pdf/1506.02640.pdf Paper] || <br />
|-<br />
|Nov 27 || Qi Chu, Gloria Huang, Di Sang, Amanda Lam, Yan Jiao, Shuyue Wang, Yutong Wu, Shikun Cui || 10|| A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques || [https://arxiv.org/pdf/1707.02919.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Text_Mining_Classification_Clustering_Extraction_Techniques Summary]<br />
|-<br />
|NOv 29 || Jameson Ngo, Amy Xu, Aden Grant, Yu Hao Wang, Andrew McMurry, Baizhi Song, Yongqi Dong || 11|| Towards Deep Learning Models Resistant to Adversarial Attacks || [https://arxiv.org/pdf/1706.06083.pdf Paper] || <br />
|-<br />
|Nov 29 || Qianying Zhao, Hui Huang, Lingyun Yi, Jiayue Zhang, Siao Chen, Rongrong Su, Gezhou Zhang, Meiyu Zhou || 12|| XGBoost: A Scalable Tree Boosting System || [http://delivery.acm.org/10.1145/2940000/2939785/p785-chen.pdf?ip=129.97.124.2&id=2939785&acc=CHORUS&key=FD0067F557510FFB%2E9219CF56F73DCF78%2E4D4702B0C3E38B35%2E6D218144511F3437&__acm__=1542321481_ffea42f38a2b3325af4990280553c10f Paper] ||<br />
|-<br />
|Nov 28 || Hudson Ash, Stephen Kingston, Richard Zhang, Alexandre Xiao, Ziqiu Zhu || 13 || Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness || [https://arxiv.org/pdf/1608.05842.pdf Paper] ||<br />
|-<br />
|Nov 21 || Frank Jiang, Yuan Zhang, Jerry Hu || 14 || Distributed Representations of Words and Phrases and their Compositionality || [https://arxiv.org/pdf/1310.4546.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality Summary]<br />
|-<br />
|Nov 21 || Yu Xuan Lee, Tsen Yee Heng || 15 || Gradient Episodic Memory for Continual Learning || [http://papers.nips.cc/paper/7225-gradient-episodic-memory-for-continual-learning.pdf Paper] || <br />
|-<br />
|Nov 28 || Ben Zhang, Rees Simmons, Sunil Mall || 16 || Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift || [https://arxiv.org/pdf/1502.03167.pdf Paper] ||</div>
Ajylam
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=F18-STAT841-Proposal&diff=37663
F18-STAT841-Proposal
2018-11-03T15:48:08Z
<p>Ajylam: </p>
<hr />
<div><br />
'''Use this format (Don’t remove Project 0)'''<br />
<br />
'''Project # 0'''<br />
Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
'''Title:''' Making a String Telephone<br />
<br />
'''Description:''' We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 1'''<br />
Group members:<br />
<br />
Weng, Jiacheng<br />
<br />
Li, Keqi<br />
<br />
Qian, Yi<br />
<br />
Liu, Bomeng<br />
<br />
'''Title:''' RSNA Pneumonia Detection Challenge<br />
<br />
'''Description:''' <br />
<br />
Our team’s project is the RSNA Pneumonia Detection Challenge from Kaggle competition. The primary goal of this project is to develop a machine learning tool to detect patients with pneumonia based on their chest radiographs (CXR). <br />
<br />
Pneumonia is an infection that inflames the air sacs in human lungs which has symptoms such as chest pain, cough, and fever [1]. Pneumonia can be very dangerous especially to infants and elders. In 2015, 920,000 children under the age of 5 died from this disease [2]. Due to its fatality to children, diagnosing pneumonia has a high order. A common method of diagnosing pneumonia is to obtain patients’ chest radiograph (CXR) which is a gray-scale scan image of patients’ chests using x-ray. The infected region due to pneumonia usually shows as an area or areas of increased opacity [3] on CXR. However, many other factors can also contribute to increase in opacity on CXR which makes the diagnose very challenging. The diagnose also requires highly-skilled clinicians and a lot of time of CXR screening. The Radiological Society of North America (RSNA®) sees the opportunity of using machine learning to potentially accelerate the initial CXR screening process. <br />
<br />
For the scope of this project, our team plans to contribute to solving this problem by applying our machine learning knowledge in image processing and classification. Team members are going to apply techniques that include, but are not limited to: logistic regression, random forest, SVM, kNN, CNN, etc., in order to successfully detect CXRs with pneumonia.<br />
<br />
<br />
[1] (Accessed 2018, Oct. 4). Pneumonia [Online]. MAYO CLINIC. Available from: https://www.mayoclinic.org/diseases-conditions/pneumonia/symptoms-causes/syc-20354204<br />
[2] (Accessed 2018, Oct. 4). RSNA Pneumonia Detection Challenge [Online]. Kaggle. Available from: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge<br />
[3] Franquet T. Imaging of community-acquired pneumonia. J Thorac Imaging 2018 (epub ahead of print). PMID 30036297<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 2'''<br />
Group members:<br />
<br />
Hou, Zhaoran<br />
<br />
Zhang, Chi<br />
<br />
'''Title:''' <br />
<br />
'''Description:'''<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 3'''<br />
Group members:<br />
<br />
Hanzhen Yang<br />
<br />
Jing Pu Sun<br />
<br />
Ganyuan Xuan<br />
<br />
Yu Su<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:'''<br />
<br />
Our team chose the [https://www.kaggle.com/c/quickdraw-doodle-recognition Quick, Draw! Doodle Recognition Challenge] from the Kaggle Competition. The goal of the competition is to build an image recognition tool that can classify hand-drawn doodles into one of the 340 categories.<br />
<br />
The main challenge of the project remains in the training set being very noisy. Hand-drawn artwork may deviate substantially from the actual object, and is almost definitively different from person to person. Mislabeled images also present a problem since they will create outlier points when we train our models. <br />
<br />
We plan on learning more about some of the currently mature image recognition algorithms to inspire and develop our own model.<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 4'''<br />
Group members:<br />
<br />
Snaith, Mitchell<br />
<br />
'''Title:''' Reproducibility report: *Fixing Variational Bayes: Deterministic Variational Inference for Bayesian Neural Networks*<br />
<br />
'''Description:''' <br />
<br />
The paper *Fixing Variational Bayes: Deterministic Variational Inference for Bayesian Neural Networks* [1] has been submitted to ICLR 2019. It aims to "fix" variational Bayes and turn it into a robust inference tool through two innovations. <br />
<br />
Goals are to: <br />
<br />
- reproduce the deterministic variational inference scheme as described in the paper without referencing the original author's code, providing a 3rd party implementation<br />
<br />
- reproduce experiment results with own implementation, using the same NN framework for reference implementations of compared methods described in the paper<br />
<br />
- reproduce experiment results with the author's own implementation<br />
<br />
- explore other possible applications of variational Bayes besides heteroscedastic regression<br />
<br />
[1] OpenReview location: https://openreview.net/forum?id=B1l08oAct7<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 5'''<br />
Group members:<br />
<br />
Rebecca, Chen<br />
<br />
Susan,<br />
<br />
Mike, Li<br />
<br />
Ted, Wang<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:''' <br />
<br />
Classification has become a more and more eye-catching, especially with the rise of machine learning in these years. Our team is particularly interested in machine learning algorithms that optimize some specific type image classification. <br />
<br />
In this project, we will dig into base classifiers we learnt from the class and try to cook them together to find an optimal solution for a certain type images dataset. Currently, we are looking into a dataset from Kaggle: Quick, Draw! Doodle Recognition Challenge. The dataset in this competition contains 50M drawings among 340 categories and is the subset of the world’s largest doodling dataset and the doodling dataset is updating by real drawing game players. Anyone can contribution by joining it! (quickdraw.withgoogle.com).<br />
<br />
For us, as machine learning students, we are more eager to help getting a better classification method. By “better”, we mean find a balance between simplify and accuracy. We will start with neural network via different activation functions in each layer and we will also combine base classifiers with bagging, random forest, boosting for ensemble learning. Also, we will try to regulate our parameters to avoid overfitting in training dataset. Last, we will summary features of this type image dataset, formulate our solutions and standardize our steps to solve this kind problems <br />
<br />
Hopefully, we can not only finish our project successfully, but also make a little contribution to machine learning research field.<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 6'''<br />
Group members:<br />
<br />
Ngo, Jameson<br />
<br />
Xu, Amy<br />
<br />
'''Title:''' Kaggle Challenge: [https://www.kaggle.com/c/quickdraw-doodle-recognition Quick, Draw! Doodle Recognition ]<br />
<br />
'''Description:''' <br />
<br />
We will participate in the Quick, Draw! Doodle Reconigtion competition featured on Kaggle. We will classify doodles based on the images given from a game.<br />
<br />
These images may be incomplete or mislabeled, so we would need to find a way to effectively ignore/solve these issues in order to correctly classify them.<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 7'''<br />
Group members:<br />
<br />
Qianying Zhao<br />
<br />
Hui Huang<br />
<br />
Meiyu Zhou<br />
<br />
Gezhou Zhang<br />
<br />
'''Title:''' Google Analytics Customer Revenue Prediction<br />
<br />
'''Description:''' <br />
Our group will participate in the featured Kaggle competition of Google Analytics Customer Revenue Prediction. In this competition, we will analyze customer dataset from a Google Merchandise Store selling swags to predict revenue per customer using Rstudio. Our presentation report will include not only how we've concluded by classifying and analyzing provided data with appropriate models, but also how we performed in the contest.<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 8'''<br />
Group members:<br />
<br />
Jiayue Zhang<br />
<br />
Lingyun Yi<br />
<br />
Rongrong Su<br />
<br />
Siao Chen<br />
<br />
<br />
'''Title:''' Kaggle--Two Sigma: Using News to Predict Stock Movements<br />
<br />
<br />
'''Description:''' <br />
Stock price is affected by the news to some extent. What is the news influence on stock price and what is the predicted power of the news? <br />
What we are going to do is to use the content of news to predict the tendency of stock price. We will mine the data, finding the useful information behind the big data. As the result we will predict the stock price performance when market faces news.<br />
<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 9'''<br />
Group members:<br />
<br />
Hassan, Ahmad Nayar<br />
<br />
McLellan, Isaac<br />
<br />
Brewster, Kristi<br />
<br />
Melek, Marina Medhat Rassmi <br />
<br />
<br />
'''Title:''' Quick, Draw! Doodle Recognition<br />
<br />
'''Description:''' <br />
<br />
'''Background'''<br />
<br />
Google’s Quick, Draw! is an online game where a user is prompted to draw an image depicting a certain category in under 20 seconds. As the drawing is being completed, the game uses a model which attempts to correctly identify the image being drawn. With the aim to improve the underlying pattern recognition model this game uses, Google is hosting a Kaggle competition asking the public to build a model to correctly identify a given drawing. The model should classify the drawing into one of the 340 label categories within the Quick, Draw! Game in 3 guesses or less.<br />
<br />
'''Proposed Approach'''<br />
<br />
Each image/doodle (input) is considered as a matrix of pixel values. In order to classify images, we need to essentially reshape an images’ respective matrix of pixel values - convolution. This would reduce the dimensionality of the input significantly which in turn reduces the number of parameters of any proposed recognition model. Using filters, pooling layers and further convolution, a final layer called the fully connected layer is used to correlate images with categories, assigning probabilities (weights) and hence classifying images. <br />
<br />
This approach to image classification is called convolution neural network (CNN) and we propose using this to classify the doodles within the Quick, Draw! Dataset.<br />
<br />
To control overfitting and underfitting of our proposed model and minimizing the error, we will use different architectures consisting of different types and dimensions of pooling layers and input filters.<br />
<br />
'''Challenges'''<br />
<br />
This project presents a number of interesting challenges:<br />
* The data given for training is noisy in that it contains drawings that are incomplete or simply poorly drawn. Dealing with this noise will be a significant part of our work. <br />
* There are 340 label categories within the Quick, Draw! dataset, this means that the model created must be able to classify drawings based on a large pool of information while making effective use of powerful computational resources.<br />
<br />
'''Tools & Resources'''<br />
<br />
* We will use Python & MATLAB.<br />
* We will use the Quick, Draw! Dataset available on the Kaggle competition website. <https://www.kaggle.com/c/quickdraw-doodle-recognition/data><br />
<br />
--------------------------------------------------------------------<br />
'''Project # 10'''<br />
Group members:<br />
<br />
Lam, Amanda<br />
<br />
Huang, Xiaoran<br />
<br />
Chu, Qi<br />
<br />
Sang, Di<br />
<br />
'''Title:''' Kaggle Competition: Human Protein Atlas Image Classification<br />
<br />
'''Description:'''<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 11'''<br />
Group members:<br />
<br />
Bobichon, Philomene<br />
<br />
Maheshwari, Aditya<br />
<br />
An, Zepeng<br />
<br />
Stranc, Colin<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:''' <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 12'''<br />
Group members:<br />
<br />
Huo, Qingxi<br />
<br />
Yang, Yanmin<br />
<br />
Cai, Yuanjing<br />
<br />
Wang, Jiaqi<br />
<br />
'''Title:''' <br />
<br />
'''Description:''' <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 13'''<br />
Group members:<br />
<br />
Ross, Brendan<br />
<br />
Barenboim, Jon<br />
<br />
Lin, Junqiao<br />
<br />
Bootsma, James<br />
<br />
'''Title:''' Expanding Neural Netwrok<br />
<br />
'''Description:''' The goal of our project is to create an expanding neural network algorithm which starts off by training a small neural network then expands it to a larger one. We hypothesize that with the proper expansion method we could decrease training time and prevent overfitting. The method we wish to explore is to link together input dimensions based on covariance. Then when the neural network reaches convergence we create a larger neural network without the links between dimensions and using starting values from the smaller neural network. <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 14'''<br />
Group members:<br />
<br />
Schneider, Jason <br />
<br />
Walton, Jordyn <br />
<br />
Abbas, Zahraa<br />
<br />
Na, Andrew<br />
<br />
'''Title:''' Application of ML Classification to Cancer Identification<br />
<br />
'''Description:''' The application of machine learning to cancer classification based on gene expression is a topic of great interest to physicians and biostatisticians alike. We would like to work on this for our final project to encourage the application of proven ML techniques to improve accuracy of cancer classification and diagnosis. In this project, we will use the dataset from Golub et al. [1] which contains data on gene expression on tumour biopsies to train a model and classify healthy individuals and individuals who have cancer.<br />
<br />
One challenge we may face pertains to the way that the data was collected. Some parts of the dataset have thousands of features (which each represent a quantitative measure of the expression of a certain gene) but as few as twenty samples. We propose some ways to mitigate the impact of this; including the use of PCA, leave-one-out cross validation, or regularization. <br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 15'''<br />
Group members:<br />
<br />
Praneeth, Sai<br />
<br />
Peng, Xudong <br />
<br />
Li, Alice<br />
<br />
Vajargah, Shahrzad<br />
<br />
'''Title:''' Google Analytics Customer Revenue Prediction [1] - A Kaggle Competition<br />
<br />
'''Description:''' Guess which cabin class in airlines is the most profitable? One might guess economy - but in reality, it's the premium classes that show higher returns. According to research conducted by Wendover productions [2], despite having less than 50 seats and taking up more space than the economy class, premium classes end up driving more revenue than other classes.<br />
<br />
In fact, just like airlines, many companies adopt the business model where the vast majority of revenue is derived from a minority group of customers. As a result, data-intensive promotional strategies are getting more and more attention nowadays from marketing teams to further improve company returns.<br />
<br />
In this Kaggle competition, we are challenged to analyze a Google Merchanidize Store's customer dataset to predict revenue per customer. We will implement a series of data analytics methods including pre-processing, data augmentation, and parameter tuning. Different classification algorithms will be compared and optimized in order to achieve the best results.<br />
<br />
'''Reference:'''<br />
<br />
[1] Kaggle. (2018, Sep 18). Google Analytics Customer Revenue Prediction. Retrieved from https://www.kaggle.com/c/ga-customer-revenue-prediction<br />
<br />
[2] Kottke, J (2017, Mar 17). The economics of airline classes. Retrieved from https://kottke.org/17/03/the-economics-of-airline-classes<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 16'''<br />
Group members:<br />
<br />
Wang, Yu Hao<br />
<br />
Grant, Aden <br />
<br />
McMurray, Andrew<br />
<br />
Song, Baizhi<br />
<br />
'''Title:''' Two Sigma: Using News to Predict Stock Movements - A Kaggle Competition<br />
<br />
By analyzing news data to predict stock prices, Kagglers have a unique opportunity to advance the state of research in understanding the predictive power of the news. This power, if harnessed, could help predict financial outcomes and generate significant economic impact all over the world.<br />
<br />
Data for this competition comes from the following sources:<br />
<br />
Market data provided by Intrinio.<br />
News data provided by Thomson Reuters. Copyright ©, Thomson Reuters, 2017. All Rights Reserved. Use, duplication, or sale of this service, or data contained herein, except as described in the Competition Rules, is strictly prohibited.<br />
<br />
we will test a variety of classification algorithms to determine an appropriate model.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 17'''<br />
Group Members:<br />
<br />
Jiang, Ya Fan<br />
<br />
Zhang, Yuan<br />
<br />
Hu, Jerry Jie<br />
<br />
'''Title:''' Kaggle Competition: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:''' Construction of a classifier that can learn from noisy training data and generalize to a clean test set . Training data coming from the Google game "Quick, Draw"<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 18'''<br />
Group Members:<br />
<br />
Zhang, Ben<br />
<br />
'''Title:''' Two Sigma: Using News to Predict Stock Movements<br />
<br />
'''Description:''' Use news analytics to predict stock price performance. This is subject to change.<br />
<br />
----------------------------------------------------------------------<br />
'''Project # 19'''<br />
Group Members:<br />
<br />
Yan Yu Chen<br />
<br />
Qisi Deng<br />
<br />
Hengxin Li<br />
<br />
Bochao Zhang<br />
<br />
Our team currently has two interested topics at hand, and we have summarized the objective of each topic below. Please note that we will narrow down our choices after further discussions with the instructor.<br />
<br />
'''Description 1:''' With 14 percent of American claiming that social media is their most dominant news source, fake news shared on Facebook and Twitter are invading people’s information learning experience. Concomitantly, the quality and nature of online news have been gradually diluted by fake news that are sometimes imperceptible. With an aim of creating an unalloyed Internet surfing experience, we sought to develop a tool that performs fake news detection and classification. <br />
<br />
'''Description 2:''' Statistics Canada has recently reported an increasing trend of Toronto’s violent crime score. Though the Royal Canadian Mounted Police has put in the effort and endeavor to track crimes, the ambiguous snapshots captured by outdated cameras often hamper the investigation. Motivated by the aforementioned circumstance, our second interest focuses on the accurate numeral and letter identification within variable-resolution images.<br />
<br />
----------------------------------------------------------------------<br />
'''Project # 20'''<br />
Group Members:<br />
<br />
Dong, Yongqi (Michael)<br />
<br />
Kingston, Stephen<br />
<br />
'''Title:''' Kaggle--Two Sigma: Using News to Predict Stock Movements <br />
<br />
'''Description:''' The movement in price of a trade-able security, or stock, on any given day is an aggregation of each individual market participant’s appraisal of the intrinsic value of the underlying company or assets. These values are primarily driven by investors’ expectations of the company’s ability to generate future free cash flow. A steady stream of information on the state of macro and micro-economic variables which affect a company’s operations inform these market actors, primarily through news articles and alerts. We would like to take a universe of news headlines and parse the information into features, which allow us to classify the direction and ‘intensity’ of a stock’s price move, in any given day. Strategies may include various classification methods to determine the most effective solution.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 21'''<br />
Group members:<br />
<br />
Xiao, Alex<br />
<br />
Zhang, Richard<br />
<br />
Ash, Hudson<br />
<br />
Zhu, Ziqiu<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge [Subject to Change]<br />
<br />
'''Description:''' <br />
<br />
"Quick, Draw! was released as an experimental game to educate the public in a playful way about how AI works. The game prompts users to draw an image depicting a certain category, such as ”banana,” “table,” etc. The game generated more than 1B drawings, of which a subset was publicly released as the basis for this competition’s training set. That subset contains 50M drawings encompassing 340 label categories."<br />
<br />
Our goal as students are to a build classification tool that will classify hand-drawn doodles into one of the 340 label categories.</div>
Ajylam
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18&diff=36563
stat441F18
2018-10-04T17:48:16Z
<p>Ajylam: </p>
<hr />
<div><br />
<br />
== [[F18-STAT841-Proposal| Project Proposal ]] ==<br />
<br />
[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|NOv 13 || || 1|| || || <br />
|-<br />
|Nov 13 || || 2|| || || <br />
|-<br />
|NOv 15 || || 3|| || || <br />
|-<br />
|Nov 15 || || 4|| || || <br />
|-<br />
|NOv 20 || || 5|| || || <br />
|-<br />
|Nov 20 || Maya(Mahdiyeh) Bayati, Saber Malekmohammadi, Vincent || 6|| Will be added soon || || <br />
|-<br />
|NOv 22 || Qingxi Huo, Yanmin Yang, Jiaqi Wang, Yuanjing Cai || 7|| Will be added soon || || <br />
|-<br />
|Nov 22 || Hanzhen Yang, Jing Pu Sun, Ganyuan Xuan, Yu Su|| 8|| Deep Residual Learning for Image Recognition || [http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf Paper] || <br />
|-<br />
|NOv 27 || Mitchell Snaith || 9|| You Only Look Once: Unified, Real-Time Object Detection, V1 -> V3 || [https://arxiv.org/pdf/1506.02640.pdf Paper] || <br />
|-<br />
|Nov 27 || Qi Chu, Gloria Huang, Dylan Sang, Amanda Lam|| 10|| tba || || <br />
|-<br />
|NOv 29 || || 11|| || || <br />
|-<br />
|Nov 29 || || 12|| || ||</div>
Ajylam