hierarchical Dirichlet Processes: Difference between revisions
Yorkerjaster (talk | contribs) |
m (Conversion script moved page Hierarchical Dirichlet Processes to hierarchical Dirichlet Processes: Converting page titles to lowercase) |
||
(27 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
If we can put a prior on random partition and use likelihood distribution to model data points, we can use the Bayesian framework to learn the latent dimension, which is the main idea of Dirichlet process mixture model. When it comes to clustering grouped data, we usually assume some information is shared between groups | If we can put a prior on random partition and use likelihood distribution to model data points, we can use the Bayesian framework to learn the latent dimension, which is the main idea of Dirichlet process mixture model. When it comes to clustering grouped data, we usually assume some information is shared between groups, which means our model should be able to capture this information. One natural proposal about hierarchical clustering problem is each group j is modeled by a Dirichlet process mixture model <math> G_j </math> ~ <math>DP(\alpha, G_0)</math>. However, if the <math>G_0</math> is continuous, this proposal generally cannot model shared information between groups. One idea is to make <math>G_0</math> become discrete by limiting the choice of <math> G_0</math>. | ||
However, if the G_0 | The main idea of this paper is to assume <math>G_0</math> drawn from other Dirichlet process <math>DP(\lambda, H)</math>, where H is any base measure. Note that <math>G_0</math> is discrete with probability one due to the fact of Dirichlet process. | ||
The main idea of this paper is to | |||
==Introduction== | ==Introduction== | ||
It is a common practice to tune the latent dimension K in order to get the best performance of a model. One weakness of this practice is that the corpus is static and unchanged, which means it is generally difficult to do inference given new unseen data points. In that case, we may either re-train the model | It is a common practice to tune the latent dimension K in order to get the best performance of a model. One weakness of this practice is that the corpus is static and unchanged, which means it is generally difficult to do inference given new unseen data points. In that case, we may either re-train the model or use some algebraic/heuristic fold-in technique to do inference. | ||
If we can come out some prior on the latent dimension and likelihood distribution on data points, we learn the latent dimension K on the fly from the corpus based on the Bayesian framework. This is a important property when it comes to online stream mining. | If we can come out some prior on the latent dimension and likelihood distribution on data points, we can learn the latent dimension K on-the-fly from the corpus based on the Bayesian framework. This is a important property when it comes to online data stream mining. | ||
===Hierarchical clustering=== | ===Hierarchical clustering=== | ||
A recurring theme in statistics is the need to separate observations into groups, and yet allow the groups to remain liked-to "share statistical strength". In the Bayesian formalism such sharing is achieved naturally via hierarchical modeling; parameters are shared among groups, and the randomness of the parameters induces dependencies among the groups. Estimates based on the posterior distribution exhibit "shrinkage" | A recurring theme in statistics is the need to separate observations into groups, and yet allow the groups to remain liked-to "share statistical strength". In the Bayesian formalism such sharing is achieved naturally via hierarchical modeling; parameters are shared among groups, and the randomness of the parameters induces dependencies among the groups. Estimates based on the posterior distribution exhibit "shrinkage". | ||
==Dirichlet process== | ==Dirichlet process== | ||
<math> G </math> ~ <math> DP(\alpha,G_0) </math> | |||
* for each data point <math> x_i </math> | |||
** <math>\theta_i</math> ~ <math> G </math> | |||
** <math>x_i</math> ~ <math> \theta_i </math> | |||
===stick-breaking construction=== | ===stick-breaking construction=== | ||
<math> G </math> ~ <math> DP(\alpha,G_0) </math> is equivalent to the following construction. | |||
For each latent model <math>k</math> | |||
*<math> \theta_k </math> ~ <math> G_0 </math> | |||
*<math> \pi_k' </math> ~ <math> Beta(1,\alpha) </math> | |||
*<math> \pi_k = \pi_k'\prod_{l=1}^{k-1}(1-\pi_k') </math> | |||
For each data point <math> x_i </math> | |||
** <math>\theta_i</math> ~ <math> G </math> | |||
** <math>x_i</math> ~ <math> \theta_i </math> | |||
===Chinese restaurant process=== | ===Chinese restaurant process=== | ||
=== | |||
for each data point <math> x_i </math> | |||
* <math>\theta_i</math> ~ <math> \sum_{n=1}^{i-1}\frac{1}{i-1+\alpha}\delta_{\theta_n} + \frac{\alpha}{i-1+\alpha}G_0 </math> | |||
* <math>x_i</math> ~ <math> \theta_i </math> | |||
Another representation based on Chinese restaurant process can be found in below. | |||
The table assignment <math>t_i</math> is followed by | |||
<math>p(t_i=k)=\frac{m_k}{i-1+\alpha} </math> | |||
<math>p(t_i=k_{new})=\frac{\alpha}{i-1+\alpha} </math> | |||
<math>\theta_{k_{new}}</math> ~ <math>G_0</math> | |||
where <math>m_k=\sum_{n=1}^{i-1}I(t_n=k)</math> and <math>I()</math> is the indicator function. | |||
Note that <math>\sum_{k}m_k=i-1</math> | |||
===the infinite limit of finite mixture models=== | ===the infinite limit of finite mixture models=== | ||
<math>\pi</math> ~ <math>Dir(\frac{\alpha}{K},\dots,\frac{\alpha}{K})</math> | |||
For each latent model <math>k</math> | |||
*<math>\theta_{k}</math> ~ <math>G_0</math> | |||
For each data point <math>x_i</math> | |||
*<math>z_i</math> ~ <math>Multi(\pi)</math> | |||
*<math>x_i</math> ~ <math>\theta_{z_i}</math> | |||
Where Dir() is a K-dimension Dirichlet distribution and Multi() is a K-dimension Multinomial distribution. | |||
When K goes into infinite, we can show that such construction become a Dirichlet process. | |||
==Hierarchical Dirichlet process== | ==Hierarchical Dirichlet process== | ||
<math> G_0 </math> ~ <math> DP(\lambda,H) </math> | |||
For each group j | |||
*<math> G_j </math> ~ <math> DP(\alpha,G_0) </math> | |||
** for each data point <math> x_{ij} </math> in group j | |||
*** <math>\theta_{ij}</math> ~ <math> G_j </math> | |||
*** <math>x_{ij}</math> ~ <math> \theta_{ij} </math> | |||
===stick-breaking construction=== | ===stick-breaking construction=== | ||
For each latent model <math>\theta_{k}</math> | |||
*<math>\theta_{k}</math> ~ <math>H</math> | |||
*<math>\beta_k'</math> ~ <math>Beta(1,\lambda)</math> | |||
*<math>\beta_{k}=\beta_k'\prod_{l=1}^{k-1}(1-\beta_k') </math> | |||
For each group j | |||
*<math>\pi_k'</math> ~ <math>Beta(\alpha\beta_k,\alpha(1-\sum_{l=1}^{k}\beta_l))</math> | |||
*<math>\pi_{k}=\pi_k'\prod_{l=1}^{k-1}(1-\pi_k') </math> | |||
*for each data point <math>x_{ij}</math> in group j | |||
**<math>z_{ij}</math> ~ <math>Multi(\pi_j)</math> | |||
**<math>x_{ij}</math> ~ <math>\theta_{z_{ij}}</math> | |||
===Chinese restaurant franchise=== | ===Chinese restaurant franchise=== | ||
For each restaurant/group j | |||
*each table t in restaurant j | |||
**<math>\theta_{jt}</math> ~ <math> \sum_{k=1}^{K}\frac{m_{.k}}{m_{..}+\lambda}\delta_{\theta_{k}} + \frac{\lambda}{m_{..}+\lambda}H </math> | |||
*for each data point <math> x_{ij} </math> in restaurant j | |||
** <math>\varphi_{ij}</math> ~ <math> \sum_{t=1}^{m_{j.}}\frac{n_{jt.}}{i-1+\alpha}\delta_{\theta_{jt}} + \frac{\alpha}{i-1+\alpha}G_0 </math> | |||
** <math>x_i</math> ~ <math> \varphi_{ij} </math> | |||
Another representation based on Chinese restaurant process can be found in below. | |||
The table assignment <math>t_{ij}</math> is followed by | |||
<math>p(t_{ij}=h)=\frac{n_{jt.}}{i-1+\alpha} </math> | |||
<math>p(t_{ij}=h_{new})=\frac{\alpha}{i-1+\alpha} </math> | |||
<math>\theta_{h_{new}}</math> ~ <math>G_0</math> | |||
If we integrate out G_0, h_{new} will correspond to the dish label. | |||
The dish assignment <math>k_{jt}</math> is followed by | |||
<math>p(k_{jt}=k)=\frac{m_{.k}}{m_{..}+\lambda} </math> | |||
<math>p(k_{jt}=k_{new})=\frac{\lambda}{m_{..}+\lambda} </math> | |||
<math>\theta_{t_{new}}</math> ~ <math>H</math> | |||
Where: | |||
<math>m_{..}</math> is the total number of table in all restaurants, | |||
<math>m_{.k}</math> is the number of table severing dish k in all restaurants, | |||
<math>m_{j.}</math> is the total number of table in restaurant j , | |||
<math>n_{jt.}</math> is the total number of customer sitting in table t in restaurant j. | |||
===the infinite limit of finite mixture models=== | ===the infinite limit of finite mixture models=== | ||
<math>\beta</math> ~ <math>Dir(\frac{\lambda}{K},\dots,\frac{\lambda}{K})</math> | |||
For each latent model <math>k</math> | |||
*<math>\theta_{k}</math> ~ <math>H</math> | |||
For each group j | |||
*<math>\pi_{j}</math> ~ <math>Dir(\alpha\beta)</math> | |||
*for each data point <math>x_{ij}</math> in group j | |||
**<math>z_{ij}</math> ~ <math>Multi(\pi_j)</math> | |||
**<math>x_{ij}</math> ~ <math>\theta_{z_{ij}}</math> | |||
Where Dir() is a K-dimension Dirichlet distribution and Multi() is a K-dimension Multinomial distribution. | |||
When K goes into infinite, we can show that such construction become a Hierarchical Dirichlet process. | |||
==Inference== | ==Inference== | ||
===Markov chain Monte Carlo method=== | ===Markov chain Monte Carlo method (MCMC)=== | ||
We can use MCMC sampling framework based on the different representations of Hierarchical Dirichlet Process (HDP) mentioned above to estimate parameters in HDP and do inference for new unseen data points. | |||
==Relationship with Latent Dirichlet allocation== | |||
The HDP mixture model is a natural nonparametric generalization of Latent Dirichlet allocation, where the number of topics can be unbounded and learnt from data. Here each group is a document consisting of a bag of words, each cluster is a topic, and each document is a mixture of topics. The HDP is also a core component of the infinite hidden Markov model, which is a nonparametric generalization of the hidden Markov model allowing the number of states to be unbounded and learnt from data. | |||
==Conclusions== | |||
The paper described the use of HDP for clustering. HDP is a non-parametric hierarchical model of Dirichlet processes. HDPs can automatically determine the number of mixture components due to the non-parametric nature of it. Furthermore, in HDP, components are shared between the groups and therefore, statistical strength is shared across the groups. | |||
The | |||
Latest revision as of 08:46, 30 August 2017
If we can put a prior on random partition and use likelihood distribution to model data points, we can use the Bayesian framework to learn the latent dimension, which is the main idea of Dirichlet process mixture model. When it comes to clustering grouped data, we usually assume some information is shared between groups, which means our model should be able to capture this information. One natural proposal about hierarchical clustering problem is each group j is modeled by a Dirichlet process mixture model [math]\displaystyle{ G_j }[/math] ~ [math]\displaystyle{ DP(\alpha, G_0) }[/math]. However, if the [math]\displaystyle{ G_0 }[/math] is continuous, this proposal generally cannot model shared information between groups. One idea is to make [math]\displaystyle{ G_0 }[/math] become discrete by limiting the choice of [math]\displaystyle{ G_0 }[/math]. The main idea of this paper is to assume [math]\displaystyle{ G_0 }[/math] drawn from other Dirichlet process [math]\displaystyle{ DP(\lambda, H) }[/math], where H is any base measure. Note that [math]\displaystyle{ G_0 }[/math] is discrete with probability one due to the fact of Dirichlet process.
Introduction
It is a common practice to tune the latent dimension K in order to get the best performance of a model. One weakness of this practice is that the corpus is static and unchanged, which means it is generally difficult to do inference given new unseen data points. In that case, we may either re-train the model or use some algebraic/heuristic fold-in technique to do inference. If we can come out some prior on the latent dimension and likelihood distribution on data points, we can learn the latent dimension K on-the-fly from the corpus based on the Bayesian framework. This is a important property when it comes to online data stream mining.
Hierarchical clustering
A recurring theme in statistics is the need to separate observations into groups, and yet allow the groups to remain liked-to "share statistical strength". In the Bayesian formalism such sharing is achieved naturally via hierarchical modeling; parameters are shared among groups, and the randomness of the parameters induces dependencies among the groups. Estimates based on the posterior distribution exhibit "shrinkage".
Dirichlet process
[math]\displaystyle{ G }[/math] ~ [math]\displaystyle{ DP(\alpha,G_0) }[/math]
- for each data point [math]\displaystyle{ x_i }[/math]
- [math]\displaystyle{ \theta_i }[/math] ~ [math]\displaystyle{ G }[/math]
- [math]\displaystyle{ x_i }[/math] ~ [math]\displaystyle{ \theta_i }[/math]
stick-breaking construction
[math]\displaystyle{ G }[/math] ~ [math]\displaystyle{ DP(\alpha,G_0) }[/math] is equivalent to the following construction.
For each latent model [math]\displaystyle{ k }[/math]
- [math]\displaystyle{ \theta_k }[/math] ~ [math]\displaystyle{ G_0 }[/math]
- [math]\displaystyle{ \pi_k' }[/math] ~ [math]\displaystyle{ Beta(1,\alpha) }[/math]
- [math]\displaystyle{ \pi_k = \pi_k'\prod_{l=1}^{k-1}(1-\pi_k') }[/math]
For each data point [math]\displaystyle{ x_i }[/math]
- [math]\displaystyle{ \theta_i }[/math] ~ [math]\displaystyle{ G }[/math]
- [math]\displaystyle{ x_i }[/math] ~ [math]\displaystyle{ \theta_i }[/math]
Chinese restaurant process
for each data point [math]\displaystyle{ x_i }[/math]
- [math]\displaystyle{ \theta_i }[/math] ~ [math]\displaystyle{ \sum_{n=1}^{i-1}\frac{1}{i-1+\alpha}\delta_{\theta_n} + \frac{\alpha}{i-1+\alpha}G_0 }[/math]
- [math]\displaystyle{ x_i }[/math] ~ [math]\displaystyle{ \theta_i }[/math]
Another representation based on Chinese restaurant process can be found in below.
The table assignment [math]\displaystyle{ t_i }[/math] is followed by
[math]\displaystyle{ p(t_i=k)=\frac{m_k}{i-1+\alpha} }[/math]
[math]\displaystyle{ p(t_i=k_{new})=\frac{\alpha}{i-1+\alpha} }[/math]
[math]\displaystyle{ \theta_{k_{new}} }[/math] ~ [math]\displaystyle{ G_0 }[/math]
where [math]\displaystyle{ m_k=\sum_{n=1}^{i-1}I(t_n=k) }[/math] and [math]\displaystyle{ I() }[/math] is the indicator function.
Note that [math]\displaystyle{ \sum_{k}m_k=i-1 }[/math]
the infinite limit of finite mixture models
[math]\displaystyle{ \pi }[/math] ~ [math]\displaystyle{ Dir(\frac{\alpha}{K},\dots,\frac{\alpha}{K}) }[/math]
For each latent model [math]\displaystyle{ k }[/math]
- [math]\displaystyle{ \theta_{k} }[/math] ~ [math]\displaystyle{ G_0 }[/math]
For each data point [math]\displaystyle{ x_i }[/math]
- [math]\displaystyle{ z_i }[/math] ~ [math]\displaystyle{ Multi(\pi) }[/math]
- [math]\displaystyle{ x_i }[/math] ~ [math]\displaystyle{ \theta_{z_i} }[/math]
Where Dir() is a K-dimension Dirichlet distribution and Multi() is a K-dimension Multinomial distribution.
When K goes into infinite, we can show that such construction become a Dirichlet process.
Hierarchical Dirichlet process
[math]\displaystyle{ G_0 }[/math] ~ [math]\displaystyle{ DP(\lambda,H) }[/math]
For each group j
- [math]\displaystyle{ G_j }[/math] ~ [math]\displaystyle{ DP(\alpha,G_0) }[/math]
- for each data point [math]\displaystyle{ x_{ij} }[/math] in group j
- [math]\displaystyle{ \theta_{ij} }[/math] ~ [math]\displaystyle{ G_j }[/math]
- [math]\displaystyle{ x_{ij} }[/math] ~ [math]\displaystyle{ \theta_{ij} }[/math]
stick-breaking construction
For each latent model [math]\displaystyle{ \theta_{k} }[/math]
- [math]\displaystyle{ \theta_{k} }[/math] ~ [math]\displaystyle{ H }[/math]
- [math]\displaystyle{ \beta_k' }[/math] ~ [math]\displaystyle{ Beta(1,\lambda) }[/math]
- [math]\displaystyle{ \beta_{k}=\beta_k'\prod_{l=1}^{k-1}(1-\beta_k') }[/math]
For each group j
- [math]\displaystyle{ \pi_k' }[/math] ~ [math]\displaystyle{ Beta(\alpha\beta_k,\alpha(1-\sum_{l=1}^{k}\beta_l)) }[/math]
- [math]\displaystyle{ \pi_{k}=\pi_k'\prod_{l=1}^{k-1}(1-\pi_k') }[/math]
- for each data point [math]\displaystyle{ x_{ij} }[/math] in group j
- [math]\displaystyle{ z_{ij} }[/math] ~ [math]\displaystyle{ Multi(\pi_j) }[/math]
- [math]\displaystyle{ x_{ij} }[/math] ~ [math]\displaystyle{ \theta_{z_{ij}} }[/math]
Chinese restaurant franchise
For each restaurant/group j
- each table t in restaurant j
- [math]\displaystyle{ \theta_{jt} }[/math] ~ [math]\displaystyle{ \sum_{k=1}^{K}\frac{m_{.k}}{m_{..}+\lambda}\delta_{\theta_{k}} + \frac{\lambda}{m_{..}+\lambda}H }[/math]
- for each data point [math]\displaystyle{ x_{ij} }[/math] in restaurant j
- [math]\displaystyle{ \varphi_{ij} }[/math] ~ [math]\displaystyle{ \sum_{t=1}^{m_{j.}}\frac{n_{jt.}}{i-1+\alpha}\delta_{\theta_{jt}} + \frac{\alpha}{i-1+\alpha}G_0 }[/math]
- [math]\displaystyle{ x_i }[/math] ~ [math]\displaystyle{ \varphi_{ij} }[/math]
Another representation based on Chinese restaurant process can be found in below.
The table assignment [math]\displaystyle{ t_{ij} }[/math] is followed by
[math]\displaystyle{ p(t_{ij}=h)=\frac{n_{jt.}}{i-1+\alpha} }[/math]
[math]\displaystyle{ p(t_{ij}=h_{new})=\frac{\alpha}{i-1+\alpha} }[/math]
[math]\displaystyle{ \theta_{h_{new}} }[/math] ~ [math]\displaystyle{ G_0 }[/math]
If we integrate out G_0, h_{new} will correspond to the dish label.
The dish assignment [math]\displaystyle{ k_{jt} }[/math] is followed by
[math]\displaystyle{ p(k_{jt}=k)=\frac{m_{.k}}{m_{..}+\lambda} }[/math]
[math]\displaystyle{ p(k_{jt}=k_{new})=\frac{\lambda}{m_{..}+\lambda} }[/math]
[math]\displaystyle{ \theta_{t_{new}} }[/math] ~ [math]\displaystyle{ H }[/math]
Where:
[math]\displaystyle{ m_{..} }[/math] is the total number of table in all restaurants,
[math]\displaystyle{ m_{.k} }[/math] is the number of table severing dish k in all restaurants,
[math]\displaystyle{ m_{j.} }[/math] is the total number of table in restaurant j ,
[math]\displaystyle{ n_{jt.} }[/math] is the total number of customer sitting in table t in restaurant j.
the infinite limit of finite mixture models
[math]\displaystyle{ \beta }[/math] ~ [math]\displaystyle{ Dir(\frac{\lambda}{K},\dots,\frac{\lambda}{K}) }[/math]
For each latent model [math]\displaystyle{ k }[/math]
- [math]\displaystyle{ \theta_{k} }[/math] ~ [math]\displaystyle{ H }[/math]
For each group j
- [math]\displaystyle{ \pi_{j} }[/math] ~ [math]\displaystyle{ Dir(\alpha\beta) }[/math]
- for each data point [math]\displaystyle{ x_{ij} }[/math] in group j
- [math]\displaystyle{ z_{ij} }[/math] ~ [math]\displaystyle{ Multi(\pi_j) }[/math]
- [math]\displaystyle{ x_{ij} }[/math] ~ [math]\displaystyle{ \theta_{z_{ij}} }[/math]
Where Dir() is a K-dimension Dirichlet distribution and Multi() is a K-dimension Multinomial distribution.
When K goes into infinite, we can show that such construction become a Hierarchical Dirichlet process.
Inference
Markov chain Monte Carlo method (MCMC)
We can use MCMC sampling framework based on the different representations of Hierarchical Dirichlet Process (HDP) mentioned above to estimate parameters in HDP and do inference for new unseen data points.
Relationship with Latent Dirichlet allocation
The HDP mixture model is a natural nonparametric generalization of Latent Dirichlet allocation, where the number of topics can be unbounded and learnt from data. Here each group is a document consisting of a bag of words, each cluster is a topic, and each document is a mixture of topics. The HDP is also a core component of the infinite hidden Markov model, which is a nonparametric generalization of the hidden Markov model allowing the number of states to be unbounded and learnt from data.
Conclusions
The paper described the use of HDP for clustering. HDP is a non-parametric hierarchical model of Dirichlet processes. HDPs can automatically determine the number of mixture components due to the non-parametric nature of it. Furthermore, in HDP, components are shared between the groups and therefore, statistical strength is shared across the groups.