discLDA: Discriminative Learning for Dimensionality Reduction and Classification

From statwiki
Revision as of 20:09, 10 November 2010 by Rjcase (talk | contribs)
Jump to navigation Jump to search

Introduction

Dimensionality reduction is a common and often necessary step in most machine learning applications and high-dimensional data analyses. There exists some linear methods for dimensionality reduction such as principal component analysis (PCA) and Fisher discriminant analysis (FDA) and some nonlinear procedures such as kernelized versions of PCA and FDA as well as manifold learning algorithms.

A recent trend in dimensionality reduction is to focus on probabilistic models. These models, which include generative topological mapping, factor analysis, independent component analysis and probabilistic latent semantic analysis (pLSA), are generally specified in terms of an underlying independence assumption or low-rank assumption. The models are generally fit with maximum likelihood, although Bayesian methods are sometimes used.

LDA

Latent Dirichlet Allocation (LDA) is a Bayesian model in the spirit of probabilistic latent semantic analysis (pLSA) that models each data point (e.g., a document) as a collection of draws from a mixture model in which each mixture component is known as a topic. We use this generative probabilistic model (LDA) for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document.


The figure 1 shows the generative process for the vector wd which is a bag-of-words representation of document d. It contains three steps which are as follows:

1) [math]\displaystyle{ \theta_d }[/math] ~ Dir [math]\displaystyle{ (\alpha) }[/math]

2) zdn ~ Multi [math]\displaystyle{ (\theta_d) }[/math]

3) wdn ~ Multi [math]\displaystyle{ (\phi_{z_{dn}}) }[/math]

Given a set of documents, {wd}Dd=1, the principle task is to estimate parameter {[math]\displaystyle{ \Phi_k }[/math]}Kk=1. This is done by maximum likelihood, [math]\displaystyle{ \Phi }[/math]* = argmax [math]\displaystyle{ \phi }[/math] p ({wd};[math]\displaystyle{ \Phi }[/math])


Fig.1 LDA model


















DiscLDA

This paper presents a discriminative variation on Latent Dirichlet Allocation (LDA) in which a class-dependent linear transformation is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood. By using the transformed topic mixture proportions as a new representation of documents, a supervised dimensionality reduction algorithm is obtained which uncovers the latent structure in a document collection while preserving predictive power for the task of classification. DiscLDA is supervised form of LDA. In DiscLDA, it is assumed that supervised side information is present, and that side information is taken into account in finding a reduced dimensionality representation. In particular, side information such as class labels are incorporated into LDA, while retaining its favorable unsupervised dimensionality reduction abilities. The goal is to develop parameter estimation procedures that yield LDA topics that characterize the corpus and maximally exploit the predictive power of the side information.

In DiscLDA method, each document is additionally associated with a categorial variable or class label yd member of {1,2,...,C}, therefore for each class label y a linear transformaton Ty : RK -> RL+ is introduced; which transforms a K-dimensional Dirichlet variable [math]\displaystyle{ \theta_d }[/math] to a mixture of Dirichlet distributions: Ty[math]\displaystyle{ \theta_d }[/math] [math]\displaystyle{ \isin }[/math] RL. In order to generate a word wdn, its topic zdn is drawn from Tyd [math]\displaystyle{ \theta_d }[/math].

The following figure shows the steps in DiscLDA.

Fig.2 DiscLDA model























In DiscLDA there are class-dependent topic parameters [math]\displaystyle{ \phi }[/math]ky which determine the conditional distribution of the words:

[math]\displaystyle{ w_{dn} | z_{dn},y_d, \Phi }[/math] ~ [math]\displaystyle{ Multi }[/math] ([math]\displaystyle{ \phi }[/math] y_d z_{dn} )

by marginalizing out the hidden topic vector z, we get the following distribution for the word wdn given [math]\displaystyle{ \theta }[/math].

[math]\displaystyle{ w_{dn} | y_d, \theta_d, T }[/math] ~ [math]\displaystyle{ Multi }[/math] ([math]\displaystyle{ \Phi }[/math] [math]\displaystyle{ T }[/math]y [math]\displaystyle{ \theta }[/math]d )


Inference and Learning

Parameter Estimation

If we are given a corpus of document and their lables, then we estimate the parameter Ty by maximizing the conditional likelihood [math]\displaystyle{ \Sigma_d }[/math] log p([math]\displaystyle{ y_d }[/math] [math]\displaystyle{ | }[/math][math]\displaystyle{ w_d }[/math] ; {Ty}, [math]\displaystyle{ \Phi }[/math]) while [math]\displaystyle{ \Phi }[/math] is held fixed.

For maximizeing the conditional likelihood objective with respect to T we use gradient ascent, for a fixed [math]\displaystyle{ \Phi }[/math].

The gradient can be estimated by Monte Carlo EM, with samples from the Gibsbs sampler. Threfore the gradient can be written as:

[math]\displaystyle{ \frac{\part}{\part T} }[/math] log p(y | w,T, [math]\displaystyle{ \Phi }[/math]) = Eqty(z) [[math]\displaystyle{ \frac{\part}{\part T} }[/math] log p (w,z | y,T, [math]\displaystyle{ \Phi }[/math]) ] - E_rt(z)[[math]\displaystyle{ \frac{\part}{\part T} }[/math] log p (w,z | T, [math]\displaystyle{ \Phi }[/math])]

Also, in order to estimate the parameter [math]\displaystyle{ \Phi }[/math], we hold the transformation matrices fixed and the maximize the posterior of the model.

Dimensionality Reduction

In order to obtain a supervised dimensionality reduction method we use the average transformed topic vector as the reduced representation of a test document. We estimate it using E[Ty [math]\displaystyle{ \theta }[/math] [math]\displaystyle{ | }[/math] [math]\displaystyle{ \Phi }[/math], w, T] = [math]\displaystyle{ \Sigma_y }[/math] p(y | [math]\displaystyle{ \Phi }[/math], w, T) E[ Ty [math]\displaystyle{ \theta }[/math] | y, [math]\displaystyle{ \Phi }[/math], w, T] where [math]\displaystyle{ \Sigma_y }[/math] p(y | [math]\displaystyle{ \Phi }[/math], w, T) can be estimated using harmonic mean estimator and E[ Ty [math]\displaystyle{ \theta }[/math] | y, [math]\displaystyle{ \Phi }[/math], w, T] can be estimated from MCMC samples of z.

Comparison of LDA and DiscLDA

As mentioned in the previous sections, DiscLDA is a variation on LDA; in which the LDA parametrization is augmented to include a transformation matrix and in which this matrix is learned via a conditional likelihood criterion. This approach allows DiscLDA to:


· Retain the ability of the LDA approach to find useful low-dimensional representations of documents

· Make use of discriminative side information (labels) in forming these representations


Fig.3 t-SNE 2D embedding of the E[Ty [math]\displaystyle{ \theta }[/math] [math]\displaystyle{ | }[/math] [math]\displaystyle{ \Phi }[/math], w, T] representation of News-groups documents, after fitting to the DiscLDA model (T is fixed).
Fig.4 t-SNE 2D embedding of the E[[math]\displaystyle{ \theta }[/math] [math]\displaystyle{ | }[/math] [math]\displaystyle{ \Phi }[/math], w, T] representation of News-groups documents, after fitting to the standard unsupervised LDA model.