is Multinomial PCA Multi-faceted Clustering or Dimensionality Reduction

From statwiki
Jump to: navigation, search

Introduction

A now standard method for analyzing discrete data such as documents is clustering or unsupervised learning. A rich variety of methods exist borrowing theory and algorithm from a board spectrum of computer science:spectral method, kd-trees, data merging algorithm and so on. All these methods, however, have one significant drawback for typical application in areas such as document or image analysis: each item/document is to be classified exclusively to one class. In practice documents invariable mix a few topics, readily seen by inspection of the human-classified Reuters newswire, so the automated construction of topic hierarchies need to be reflect this. One alternative is to make clusters multifaceted whereby a document can be assigned using a convex combination to a number of clusters rather than uniquely to one cluster. This is an unsupervised version of the so-called multi-class classification task.

A body of techniques with completely goals is known as dimensionality reduction: they seek to reduce the dimensions of an item/document. The state of the art here is Principle Components Analysis(PCA). In text applications it is a PCA variant variant called latent semantic indexing LSI. A rich body of practical experience indicates LSI is not ideal for the task and theoretical justification use unrealistic assumptions. As a substitute to PCA on discrete data, authors have recently proposed discrete analogues to PCA. We refer to the method as multinomial PCA(mPCA) because it is a precise multinomial analogue formulation of PCA as a Gaussian mixture of Gaussians.

This paper describes our experiments intended to understand mPCA and whether it should be called multi-faceted clustering algorithm or a dimensionality reduction algorithm.

Multinomial PCA=

The Model

A Gaussian model