is Multinomial PCA Multi-faceted Clustering or Dimensionality Reduction

From statwiki
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Introduction

A now standard method for analyzing discrete data such as documents is clustering or unsupervised learning. A rich variety of methods exist borrowing theory and algorithm from a board spectrum of computer science:spectral method, kd-trees, data merging algorithm and so on. All these methods, however, have one significant drawback for typical application in areas such as document or image analysis: each item/document is to be classified exclusively to one class. In practice documents invariable mix a few topics, readily seen by inspection of the human-classified Reuters newswire, so the automated construction of topic hierarchies need to be reflect this. One alternative is to make clusters multifaceted whereby a document can be assigned using a convex combination to a number of clusters rather than uniquely to one cluster. This is an unsupervised version of the so-called multi-class classification task.

A body of techniques with completely goals is known as dimensionality reduction: they seek to reduce the dimensions of an item/document. The state of the art here is Principle Components Analysis(PCA). In text applications it is a PCA variant variant called latent semantic indexing LSI. A rich body of practical experience indicates LSI is not ideal for the task and theoretical justification use unrealistic assumptions. As a substitute to PCA on discrete data, authors have recently proposed discrete analogues to PCA. We refer to the method as multinomial PCA(mPCA) because it is a precise multinomial analogue formulation of PCA as a Gaussian mixture of Gaussians.

This paper describes our experiments intended to understand mPCA and whether it should be called multi-faceted clustering algorithm or a dimensionality reduction algorithm.

Multinomial PCA=

The Model

A Gaussian model