# Difference between revisions of "context Adaptive Training with Factorized Decision Trees for HMM-Based Speech Synthesis"

### Introduction

Automatic speech synthesis refers to the problem of generating a human-like speech using a machine, namely a computer, when the text is given (Also known as text-to-speech systems). It is very useful with applications in smartphones and tools for blind people. The quality of a speech synthesizer is judged by its similarity to the human voice and by its ability to be understood. There are two major approaches introduced to deal with this problem. One is unit-selection synthesis and the other one is statistical parametric synthesis.

The basic idea behind the unit-selection synthesis is that given a big enough database of spoken language, one can more or less take the cut-and-sew approach. That is to look for the right phones in the database and use it when needed. The most common way of performing unit-selection-based speech synthesis uses two objective functions. One is to make the best selection for the units, namely phones, which will be the closest to the intended unit. This is the target cost optimization. The second objective is to select the best unit, when the selection criterion is to make the sewing part of the process more natural. The latter objective is to optimize the concatenation cost. The overall cost for this approach can be formulated as follows.

$C(t_{1:n},u_{1:n})=\sum_{i=1}^{n}C^{(t)}(t_i,u_i) + \sum_{i=2}^{n}C^{(c)(u_{i-1},u_i)}.$

Where $C^{(t)}(t_i,u_i)$ and $C^{(c}(u_{i-1},u_i)$ are the target and concatenation costs, defined as follows.

\begin{align} C^{(t)}(t_i,u_i)& = \sum_{j=1}^{p}w_j^{(t)}C_j^{(t)}(t_i,u_i),\\ C^{(c)}(u_{i-1},u_i)& = \sum_{k=1}^{q}w_k^{(c)}C_k^{(c)}(u_{i-1},u_i). \end{align}

Where $w^{(t)}$ and $w^{(c)}$ are weights of the individual cost terms. The objective of the unit-selection synthesis is then to optimize the overall cost.

$\hat{u}_{1:n}=\arg \min_{u_{1:n}}\{C(t_{1:n},u_{1:n})\}$

As one can see, two major challenges of this approach is how to define the cost function for each one of the target and concatenation objectives, and how to set the weights for each one of the cost functions.

Statistical parametric synthesis <ref>The HMM-based Speech Synthesis System, http://hts.sp.nitech.ac.jp/</ref> on the other hand deals with this problem at hand in a quite different way. It is a synthesis method based on hidden Markov models. In this approach, the frequency spectrum (vocal tract), fundamental frequency (vocal source), and duration (prosody) of speech are modeled simultaneously by HMMs. Speech waveforms are generated from HMMs themselves based on the maximum likelihood criterion. There will be a learning phase for this approach, after which there will be no direct use of the chunks and bits of speech as they exist in the database, however the learned parameters will be directly affected by those speech samples. The objective of statistical parametric synthesis can be stated as follows.

$\hat{\lambda}=\arg \max_{\lambda} \{p(O|W,\lambda) \}$

Where $\lambda$ is a set of model parameters, $O$ is a set of training data, and $W$ is a set of word sequence corresponding to $O$. This is the objective of the training stage, but then we have to use the model to synthesize speech. This all by itself takes one more stage of optimization. Following is a formulation of the objective function.

$\hat{o}=\arg \max_{o} \{p(o|w,\hat{\lambda}) \}$

Where $o$ is the to-be-synthesized utterance, $w$ is the sequence of words as in $o$, and $\hat{\lambda}$ is the parameters of the model, as learned at the previous stage.

As comparing any two methods for fulfilling any given objective, there are some pros and cons for using each one of the two over the other one. Here are the strong and weak aspects of statistical parametric speech synthesis, when compared to unit-selection approach. There are some options which are not available, if one use the unit-selection approach, so they are taken as the advantages of the statistical parametric approach. The first of those advantages is the capability of transforming voice characteristics, speaking styles, and emotions. Different types of transformations are adaptation or mimicking voice, interpolation or mixing different voices, and eigen voice for representing individuals voices. Another advantage is coverage of acoustic space, which will not be confined to those samples as given in the training database. Multilingual support has been listed as one of the other advantages of the approach. Other strong points of SP are less footprint of synthetic work, more robustness, joint optimization of text analysis and waveform generation, fewer tuning parameters, and separate control of spectrum, excitation, and duration features.

Disadvantages of the SP approaches are the noise in the generated signal, known as vocoder, accuracy of acoustic modelling, and over smoothing. The structured HMM representation uses 2 sets of parameters; and because of the context adapters the decision clustering process has to be changed. Factorized decision trees addresses this requirement and constructs two independent decision trees for weak and normal texts separately and then combine both of them

### The idea of the main paper

Other than text which is supposed to be converted to a speech, there are other affecting factors when training a speech synthesis system. Those factors are known as context. Here are some of the common types of context: Neighboring phones, position of phones, syllables, words and phrases with respect to the higher level units, number of phones, syllables, words and phrases with respect to the higher level units, syllable stress and accent status, linguistic roles, intended emotional state of the generated speech and other types of context. The traditional approach to address the problem of dealing with this different kinds of contexts is by train a distinct Hidden Markov Models for every possible combination of the possible contexts. This method, clearly, would require a huge training data set that covers all the possible context combinations. Such training data sets are hard to acquire and not available in usual cases. The problem exacerbated with richer contexts, where even more number of samples would be required to learn the pattern.

Top-down decision tree-based context clustering is taken as a way of dealing with this problem, but still weak contexts, such as word-level emphasis in neutral speech, will be suffering from the lack of instances (e.g. word level synthesis) as they have less influence on the likelihood.

One approach to address this problem is to split the decision tree construction to two stages of weak contexts and the remaining contexts questions. However, it doesn't work well due to the fragmented training data and insufficient amount of data in the clustering stage. Context-adaptive is proposed in this work as a better solution of the problem at hand.

To make the training procedure context-adaptive, a set of transforms are defined besides the context-dependent HMMs. In this structure, the HMMs deal with the variability of speech, and those transforms make the HMMs adaptive to different types of context. Let's assume that the states are defined using single Gaussian distributions, then we will have:

$\hat{\Lambda}_{r_c} = F_{r_t}(\Lambda_{r_c})~\textrm{s.t.}~r_c\subseteq r_t$

Where $r_c$ is the state of interest, $\Lambda_{r_c}$ is the Gaussian parameters of the state cluster, and $\hat{\Lambda}_{r_c}$ is the adapted Gaussian parameters. Here $F_{r_c}(.)$ is the transform associated with the regression base class $r_t$. $r_c$ as the atomic cluster for adaptation should be a subset of any transform regression $r_t$.

Contexts will be grouped into normal and weak contexts, based on how strong/common they are . Word-level emphasize is considered as the weak context in this work. There will be two trees constructed based on this two major groups of context. Let's say the states of interest based on each one of the two trees will be $r_p$ for the normal context and $r_e$ for the emphasis context. Then we will have:

$\hat{\Lambda}_{r_c} = F_{r_e}(\Lambda_{r_p})~\textrm{s.t.}~r_c=r_p\cap r_e$

Three different approaches for dealing with transformations have been considered in this work which are maximum likelihood linear regression (MLLR), constrained maximum likelihood linear regression (CMLLR), and cluster adaptive training (CAT). Each one of these approached will be used to determine the adapted parameters of the states. In other words, choice of regression will affect the transform function $F_{r_e}(.)$

<references />