Research Papers Classification System: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 23: Line 23:
== Topic Modeling Using LDA ==
== Topic Modeling Using LDA ==


Latent Dirichlet allocation (LDA) is a generative probabilistic model that views documents as random mixtures over latent topics. Each topic is a distribution over words, and the goal is to extract these topics from documents.
LDA  estimates topic distribution given a document using Dirichlet priors for the distributions with a fixed number of topics. For each document, obtain a feature vector:
<math>F = \left( P\left(z_1 | d\right),  P\left(z_2 | d\right), \cdots,  P\left(z_k | d\right)  \right)</math>
In the paper, authors extract topics from preprossed paper to generate three kinds of topic sets, each with 10, 20, and 30 topics respectively. The following is a table of the 10 topic sets of highest frequency keywords.


== Term Frequency Inverse Document Frequency (TF-IDF) Calculation ==
== Term Frequency Inverse Document Frequency (TF-IDF) Calculation ==

Revision as of 18:11, 24 November 2020

Please Do NOT Edit This Summary

Presented by

Jill Wang, Junyi Yang, Yu Min Wu, Chun Kit (Calvin) Li

Introduction

This paper introduces a paper classification system that utilizes the Term frequency-inverse document frequency (TF-IDF), Latent Dirichlet Allocation (LDA), and K-means clustering. The most important technology the system used to process big data is the Hadoop Distributed File Systems (HDFS). The system can handle quantitatively complex research paper classification problems efficiently and accurately.

General Framework


Data Preprocessing

Crawling of Abstract Data

Under the assumption that audiences tend to first read the abstract of papers to gain an understanding of the papers. As a result, the abstract of any paper may include “core words” that can be used to effectively classify papers’ subjects.

An abstract is crawled to have its stop words removed. Stop words are words that are usually ignored by search engines, such as “the”, “a”, and etc. Afterwards, nouns are extracted, as a more condensed representation for efficient analysis.

Managing Paper Data

To construct an effective keyword dictionary using abstract data and keywords data in all of the crawled papers, the authors categorized keywords with similar meanings using a single representative keyword. The approach is called stemming, which is common in cleaning data. 1394 keyword categories are extracted, which is still too much to compute. Hence, only the top 30 keyword categories are used.

Topic Modeling Using LDA

Latent Dirichlet allocation (LDA) is a generative probabilistic model that views documents as random mixtures over latent topics. Each topic is a distribution over words, and the goal is to extract these topics from documents.

LDA estimates topic distribution given a document using Dirichlet priors for the distributions with a fixed number of topics. For each document, obtain a feature vector:

[math]\displaystyle{ F = \left( P\left(z_1 | d\right), P\left(z_2 | d\right), \cdots, P\left(z_k | d\right) \right) }[/math]

In the paper, authors extract topics from preprossed paper to generate three kinds of topic sets, each with 10, 20, and 30 topics respectively. The following is a table of the 10 topic sets of highest frequency keywords.

Term Frequency Inverse Document Frequency (TF-IDF) Calculation

TF-IDF is widely used to evaluate the importance of a set of words in the fields of information retrieval and text mining. It is a combination of term frequency (TF) and inverse document frequency (IDF). The idea behind this combination is It evaluates the importance of a word within a document, and It evaluates the importance of the word among the collection of all documents

The TF-IDF formula has the following form:

\[TF-IDF_{i,j} = TF_{i,j} \times IDF_{i}\]

where i stands for the [math]\displaystyle{ i^{th} }[/math] word and j stands for the [math]\displaystyle{ j^{th} }[/math] document.

Term Frequency (TF)

TF evaluates the percentage of a given word in a document. Thus, TF value indicates the importance of a word. The TF has a positive relation with the importance.

In this paper, we only calculate TF for words in the keyword dictionary obtained by LDA. For a given keyword i, [math]\displaystyle{ TF_{i,j} }[/math] is the number of times word i appears in document j divided by the total number of words in document j.

The formula for TF has the following form:

\[TF_{i,j} = \frac{n_{i,j} }{\sum_k n_{k,j} }\]

where i stands for the [math]\displaystyle{ i^{th} }[/math] word, j stands for the [math]\displaystyle{ j^{th} }[/math] document, and [math]\displaystyle{ n_{i,j} }[/math] stands for the number of times words i appear in document j.

Note that the denominator is the total number of words remaining in document j after crawling.


Document Frequency (DF)

Inverse Document Frequency (IDF)

Paper Classification Using K-means Clustering

System Testing Results

Conclusion

Critique

Reference