hierarchical Dirichlet Processes
If we can put a prior on random partition and use likelihood distribution to model data points, we can use the Bayesian framework to learn the latent dimension, which is the main idea of Dirichlet process mixture model. When it comes to clustering grouped data, we usually assume some information is shared between groups, which means our model should be able to capture this information. One natural proposal about hierarchical clustering problem is each group j is modeled by a Dirichlet process mixture model [math]\displaystyle{ G_j }[/math][math]\displaystyle{ ~ \lt math\gt DP(\alpha, G_0) }[/math]. However, if the [math]\displaystyle{ G_0 }[/math] is continuous, this proposal generally cannot model shared information between groups. One idea is to make [math]\displaystyle{ G_0 }[/math] become discrete by limiting the choice of[math]\displaystyle{ G_0 }[/math]. The main idea of this paper is to use any base measure H and assume [math]\displaystyle{ G_0 }[/math] drawn from other Dirichlet process [math]\displaystyle{ DP(\lambda, H) }[/math]. Note that [math]\displaystyle{ G_0 }[/math] is discrete with probability one due to the fact of Dirichlet process.
Introduction
It is a common practice to tune the latent dimension K in order to get the best performance of a model. One weakness of this practice is that the corpus is static and unchanged, which means it is generally difficult to do inference given new unseen data points. In that case, we may either re-train the model in the whole corpus including these unseen data points or use some algebraic/heuristic fold-in technique to do inference. If we can come out some prior on the latent dimension and likelihood distribution on data points, we learn the latent dimension K on the fly from the corpus based on the Bayesian framework. This is a important property when it comes to online stream mining.
Hierarchical clustering
A recurring theme in statistics is the need to separate observations into groups, and yet allow the groups to remain liked-to "share statistical strength". In the Bayesian formalism such sharing is achieved naturally via hierarchical modeling; parameters are shared among groups, and the randomness of the parameters induces dependencies among the groups. Estimates based on the posterior distribution exhibit "shrinkage"