is Multinomial PCA Multi-faceted Clustering or Dimensionality Reduction: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 2: Line 2:


A now standard method for analyzing discrete data such as documents is [http://en.wikipedia.org/wiki/Cluster_analysis clustering] or [http://en.wikipedia.org/wiki/Unsupervised_learning unsupervised learning]. A rich variety of methods exist borrowing theory and algorithm from a board spectrum of computer science:[http://en.wikipedia.org/wiki/Spectral_method spectral method], [http://en.wikipedia.org/wiki/Kd-tree kd-trees], data merging algorithm and so on. All these methods, however, have one significant drawback for typical application in areas such as document or image analysis: each item/document is to be classified exclusively to one class. In practice documents invariable mix a few topics, readily seen by inspection of the human-classified Reuters newswire, so the automated construction of topic hierarchies need to be reflect this. One alternative is to make clusters multifaceted whereby a document can be assigned using a convex combination to a number of clusters rather than uniquely to one cluster. This is an unsupervised version of the so-called multi-class classification task.
A now standard method for analyzing discrete data such as documents is [http://en.wikipedia.org/wiki/Cluster_analysis clustering] or [http://en.wikipedia.org/wiki/Unsupervised_learning unsupervised learning]. A rich variety of methods exist borrowing theory and algorithm from a board spectrum of computer science:[http://en.wikipedia.org/wiki/Spectral_method spectral method], [http://en.wikipedia.org/wiki/Kd-tree kd-trees], data merging algorithm and so on. All these methods, however, have one significant drawback for typical application in areas such as document or image analysis: each item/document is to be classified exclusively to one class. In practice documents invariable mix a few topics, readily seen by inspection of the human-classified Reuters newswire, so the automated construction of topic hierarchies need to be reflect this. One alternative is to make clusters multifaceted whereby a document can be assigned using a convex combination to a number of clusters rather than uniquely to one cluster. This is an unsupervised version of the so-called multi-class classification task.
A body of techniques with completely goals is known as [http://en.wikipedia.org/wiki/Dimensionality_reduction dimensionality reduction]: they seek to reduce the dimensions of an item/document. The state of the art here is [http://en.wikipedia.org/wiki/Principle_components_analysis Principle Components Analysis(PCA)]. In text applications it is a PCA variant variant called latent semantic indexing LSI. A rich body of practical experience indicates LSI is not ideal for the task and theoretical justification use unrealistic assumptions.
 
A body of techniques with completely goals is known as [http://en.wikipedia.org/wiki/Dimensionality_reduction dimensionality reduction]: they seek to reduce the dimensions of an item/document. The state of the art here is [http://en.wikipedia.org/wiki/Principle_components_analysis Principle Components Analysis(PCA)]. In text applications it is a PCA variant variant called latent semantic indexing LSI. A rich body of practical experience indicates LSI is not ideal for the task and theoretical justification use unrealistic assumptions. As a substitute to PCA on discrete data, authors have recently proposed discrete analogues to PCA. We refer to the method as multinomial PCA(mPCA) because it is a precise multinomial analogue formulation of PCA as a Gaussian mixture of Gaussians.
 
This paper describes our experiments intended to understand mPCA and whether it should be called multi-faceted clustering algorithm or a dimensionality reduction algorithm.
 
==Multinomial PCA===
===The Model===
====A Gaussian model====

Revision as of 00:57, 15 November 2010

Introduction

A now standard method for analyzing discrete data such as documents is clustering or unsupervised learning. A rich variety of methods exist borrowing theory and algorithm from a board spectrum of computer science:spectral method, kd-trees, data merging algorithm and so on. All these methods, however, have one significant drawback for typical application in areas such as document or image analysis: each item/document is to be classified exclusively to one class. In practice documents invariable mix a few topics, readily seen by inspection of the human-classified Reuters newswire, so the automated construction of topic hierarchies need to be reflect this. One alternative is to make clusters multifaceted whereby a document can be assigned using a convex combination to a number of clusters rather than uniquely to one cluster. This is an unsupervised version of the so-called multi-class classification task.

A body of techniques with completely goals is known as dimensionality reduction: they seek to reduce the dimensions of an item/document. The state of the art here is Principle Components Analysis(PCA). In text applications it is a PCA variant variant called latent semantic indexing LSI. A rich body of practical experience indicates LSI is not ideal for the task and theoretical justification use unrealistic assumptions. As a substitute to PCA on discrete data, authors have recently proposed discrete analogues to PCA. We refer to the method as multinomial PCA(mPCA) because it is a precise multinomial analogue formulation of PCA as a Gaussian mixture of Gaussians.

This paper describes our experiments intended to understand mPCA and whether it should be called multi-faceted clustering algorithm or a dimensionality reduction algorithm.

Multinomial PCA=

The Model

A Gaussian model