contributions on Context Adaptive Training with Factorized Decision Trees for HMM-Based Speech Synthesis

From statwiki
Revision as of 13:13, 7 November 2011 by Tameem (talk | contribs) (State clustering)
Jump to: navigation, search

Speech synthesis vs. speech recognition

As mentioned in the original paper, speech synthesis requires a much larger and more complex set of contexts in order to achieve high quality synthesised speech. Examples of such contexts are the following:

  • Identity of neighbouring phones to the central phone. Two phones to the left and the right of the centre phone are usually considered as phonetic neighbouring contexts
  • Position of phones, syllables, words and phrases w.r.t. higher level units
  • Number of phones, syllables, words and phrases w.r.t. higher level units
  • Syllable stress and accent status
  • Linguistic role, e.g. part-of-speech tag
  • Emotion and emphasis


There are many factors that could affect the acoustic realisation of phones. The prior knowledge of such factors form the questions used in the decision tree based state clustering procedure. Some questions are highly correlated, e.g. the phonetic broad class questions and the syllable questions. Some others are not, like the example mentioned in the paper (phonetic broad class questions and emphasis questions).

MLLR based approach

let's rewrite the first equation in (4) of the original paper as:

[math] \begin{matrix} \hat \mu_{r_{c}} = \mu_m = A_{r_{e}}\mu_{r_{p}} + b_{r_{e}} = W_{r_{e}(m)}\xi_{r_{p}(m)}\\ \hat \sum_{r_{c}} = \hat \sum_{m} = \sum_{r_{p}(m)} \end{matrix}[/math]

let m be used instead of [math]r_c[/math] to denote the index of the atomic state cluster, while [math]W_{r_{e}} = [A_{r_{e}} b_{r_{e}}][/math] is the extended transform associated with leaf node [math]r_e[/math], and all other parameters are as previously defined. From the above equation, the parameters of the combined leaf node can not be directly estimated. Instead, they are constructed using two sets of parameters with different state clustering structures. The detailed procedure is as follows:

1. Construct factorized decision trees for normal contexts [math](r_p)[/math] and emphasis contexts [math](r_e)[/math]. Let [math] m = r_e(m) \cap r_p(m)[/math] be the atomic state cluster (atomic Gaussian in the single Gaussian case)

2. Get initial parameters of the atomic Gaussians from state clustering using normal decision tree and let [math]\hat \mu_m = \mu_{r_{p}(m)} [/math]

3. Estimate [math]W_{r_{e}}[/math] given the current model parameters [math]\mu_{r_{p}(m)}[/math] and [math]\sum_{r_{p}(m)}[/math] The [math]d^{th}[/math] row of [math]W_{r_{e}}, w_{r_{e},d}^T[/math] is estimated as

[math] \begin{matrix} w_{r_{e},d} = G_{r_{e},d}^{-1}k_{r_{e},d} \end{matrix}[/math]

where the sufficient statistics for the [math]d^{th}[/math] row are given by

[math] \begin{matrix} G_{r_{e},d} = \sum_t\sum_{m\in{r_e}}\frac {\gamma_m(t)}{\sigma_{dd}^{r_{p}(m)}}\xi_{r_{p}(m)}\xi_{r_{p}(m)}^T\\ G_{r_{e},d} = \sum_t\sum_{m\in{r_e}}\frac {\gamma_m(t)o_{t,d}}{\sigma_{dd}^{r_{p}(m)}}\xi_{r_{p}(m)} \end{matrix}[/math]

where [math]o_{t,d}[/math] is the [math]d^{th}[/math] element of observation vector [math]o_t[/math], and [math]\sigma_{dd}^{r_{p}(m)}[/math] is the [math]d^{th}[/math] diagonal element of [math]\sum_{r_p(m)}[/math]. r_{p}(m) is the leaf node of the normal decision tree to which Gaussian component m belongs. [math]\gamma_m(t)[/math] is the posterior for Gaussian component m at time t which is calculated using the forward-backward algorithm with the parameters obtained from the first equation above.

4. Estimate [math]\mu_{r_c}[/math] given the emphasis transform parameters [math]W_{r_e}[/math]. Given sufficient statistics:

[math] \begin{matrix} G_{r_p} = \sum_t\sum_{m\in{r_p}}{\gamma_m(t)}A_{r_e(m)}^T \sum_m^{-1}A_{r_e(m)}\\ K_{r_p} = \sum_t\sum_{m\in{r_p}}{\gamma_m(t)}A_{r_e(m)}^T \sum_m^{-1}(o_t - b_{r_e(m)}) \end{matrix}[/math]

and the new mean is then estimated by

[math] \begin{matrix} \mu_{r_p} = G_{r_p}^{-1}k_{r_p} \end{matrix}[/math]

5. Given the updated mean [math]\mu_{r_p}[/math] and transform[math]W_{r_e}[/math], perform context adaptation to get [math]\hat \mu_m[/math]using the first equation above

6. The re-estimation of [math]\sum_{r_p}[/math] is then performed using the standard covariance update formula with the adapted [math]\hat \mu_m[/math]. Here, the statistics are accumulated for each leaf node [math]r_p[/math] rather than each individual component [math]m[/math]

[math] \begin{matrix} \sum_{r_p} = diag(\frac{\sum_{t,m\in {r_p}}\gamma_m(t)(o_t-\hat \mu_m)^T}{\sum_{t,m\in {r_p}}\gamma_m(t)}) \end{matrix}[/math]

where [math]\gamma_m(t)[/math] is calculated using [math]\hat \mu_m[/math] constructed from the new estimate of [math]\mu_{r_p}[/math] and [math]W_{r_e}[/math]

7. Go to step (3) until convergence

State clustering

The idea of decision tree based state clustering is to use a binary decision tree in which a question is attached to each non-leaf node, to assign the state distribution of every possible full context HMM model to a state cluster. When using a single Gaussian as the state output distribution, and considering that the Gaussian parameters [math]\mu(\Theta)[/math] and [math]\sum(\Theta)[/math] are ML estimates, the log likelihood of a set of states [math]\Theta[/math] can be represented as

[math] \begin{matrix} l(\Theta) = \sum_t\sum_{\theta\in\Theta}\gamma_\theta(o_t)logN(o_t;\mu(\Theta), \sum(\Theta))\\ = -\frac{\gamma(\Theta)}{2}(log |\sum(\Theta)|+ D log(2\pi) + D) \end{matrix}[/math]

where [math]D[/math] is the data dimension, [math]\gamma(\Theta)[/math] and [math]\sum(\Theta)[/math] are the total occupancy and the covariance matrix of the pooled state respectively:

[math] \begin{matrix} \gamma(\Theta) = \sum_{\theta\in\Theta}(\sum_t\gamma_\theta(o_t))\\ \sum(\Theta) = \sum_{\theta\in\Theta}(\sum_t\gamma_\theta(o_t))(\mu_\theta^T\mu_\theta + \sum_\theta) \end{matrix}[/math]

When using a structured context adaptive training representation, there are two sets of parameters to be clustered: transform and Gaussian parameters, resulting in two or more decision trees. There are three ways to build such trees:

  • Independent construction: it assumes that the factorized decision trees are independent of each other and therefore built separately. This approximation results in a factorization that is purely dependent on the different sets of context questions used during the decision tree construction
  • Dependent construction: it builds factorized decision trees one by one. Each is built assuming that the remaining parameter sets along with the sharing structure are fixed. An iterative process is used with all parameters being re-estimated after every split
  • Simultaneous construction: it builds all factorized decision trees at once. At each split, all trees are optimized inter-dependently until the stopping criterion is met.

In the paper we are talking about, independent construction is employed.

In speech synthesis techniques that use HMM, decision tree based clustering is usually performed twice to get better clustering structure. The general procedure is as follows:

1. Train mono-phone HMM's and construct untied full context dependent HMMs

2. Perform one EM re-estimation of the untied full context dependent HMMS.

3. Perform state clustering given the parameters of the untied model in step 2

4. Perform several iterations of EM re-estimation of the clustered HMM's

5. Untie the clustered HMM's and perform one more EM re-estimation to get updated parameters of the untied full context dependent HMM's

6. Perform state clustering given the parameters of the untied model in step 5

7. Perform several iterations of EM re-estimation of the clustered HMM's