context Adaptive Training with Factorized Decision Trees for HMM-Based Speech Synthesis

From statwiki
Revision as of 03:15, 22 November 2011 by M2rostam (talk | contribs)
Jump to navigation Jump to search

Introduction

Automatic speech synthesis refers to the problem of generating a human-like speech using a machine, namely a computer, when the text is given. It is very useful with applications in smartphones and tools for blind people. The quality of a speech synthesizer is judged by its similarity to the human voice and by its ability to be understood. There are two major approaches introduced to deal with this problem. One is unit-selection synthesis and the other one is statistical parametric synthesis.

The basic idea behind the unit-selection synthesis is that given a big enough database of spoken language, one can more or less take the cut-and-sew approach. That is to look for the right phones in the database and use it when needed. The most common way of performing unit-selection-based speech synthesis uses two objective functions. One is to make the best selection for the units, namely phones, which will be the closest to the intended unit. This is the target cost optimization. The second objective is to select the best unit, when the selection criterion is to make the sewing part of the process more natural. The latter objective is to optimize the concatenation cost. The overall cost for this approach can be formulated as follows.

[math]\displaystyle{ C(t_{1:n},u_{1:n})=\sum_{i=1}^{n}C^{(t)}(t_i,u_i) + \sum_{i=2}^{n}C^{(c)(u_{i-1},u_i)}. }[/math]

Where [math]\displaystyle{ C^{(t)}(t_i,u_i) }[/math] and [math]\displaystyle{ C^{(c}(u_{i-1},u_i) }[/math] are the target and concatenation costs, defined as follows.

[math]\displaystyle{ \begin{align} C^{(t)}(t_i,u_i)& = \sum_{j=1}^{p}w_j^{(t)}C_j^{(t)}(t_i,u_i),\\ C^{(c)}(u_{i-1},u_i)& = \sum_{k=1}^{q}w_k^{(c)}C_k^{(c)}(u_{i-1},u_i). \end{align} }[/math]

Where [math]\displaystyle{ w^{(t)} }[/math] and [math]\displaystyle{ w^{(c)} }[/math] are weights of the individual cost terms. The objective of the unit-selection synthesis is then to optimize the overall cost.

[math]\displaystyle{ \hat{u}_{1:n}=\arg \min_{u_{1:n}}\{C(t_{1:n},u_{1:n})\} }[/math]

As one can see, two major challenges of this approach is how to define the cost function for each one of the target and concatenation objectives, and how to set the weights for each one of the cost functions.

Statistical parametric synthesis <ref>The HMM-based Speech Synthesis System, http://hts.sp.nitech.ac.jp/</ref> on the other hand deals with this problem at hand in a quite different way. It is a synthesis method based on hidden Markov models. In this approach, the frequency spectrum (vocal tract), fundamental frequency (vocal source), and duration (prosody) of speech are modeled simultaneously by HMMs. Speech waveforms are generated from HMMs themselves based on the maximum likelihood criterion. There will be a learning phase for this approach, after which there will be no direct use of the chunks and bits of speech as they exist in the database, however the learned parameters will be directly affected by those speech samples. The objective of statistical parametric synthesis can be stated as follows.

[math]\displaystyle{ \hat{\lambda}=\arg \max_{\lambda} \{p(O|W,\lambda) \} }[/math]

Where [math]\displaystyle{ \lambda }[/math] is a set of model parameters, [math]\displaystyle{ O }[/math] is a set of training data, and [math]\displaystyle{ W }[/math] is a set of word sequence corresponding to [math]\displaystyle{ O }[/math]. This is the objective of the training stage, but then we have to use the model to synthesize speech. This all by itself takes one more stage of optimization. Following is a formulation of the objective function.

[math]\displaystyle{ \hat{o}=\arg \max_{o} \{p(o|w,\hat{\lambda}) \} }[/math]

Where [math]\displaystyle{ o }[/math] is the to-be-synthesized utterance, [math]\displaystyle{ w }[/math] is the sequence of words as in [math]\displaystyle{ o }[/math], and [math]\displaystyle{ \hat{\lambda} }[/math] is the parameters of the model, as learned at the previous stage.

As comparing any two methods for fulfilling any given objective, there are some pros and cons for using each one of the two over the other one. Here are the strong and weak aspects of statistical parametric speech synthesis, when compared to unit-selection approach. There are some options which are not available, if one use the unit-selection approach, so they are taken as the advantages of the statistical parametric approach. The first of those advantages is the capability of transforming voice characteristics, speaking styles, and emotions. Different types of transformations are adaptation or mimicking voice, interpolation or mixing different voices, and eigen voice for representing individuals voices. Another advantage is coverage of acoustic space, which will not be confined to those samples as given in the training database. Multilingual support has been listed as one of the other advantages of the approach. Other strong points of SP are less footprint of synthetic work, more robustness, joint optimization of text analysis and waveform generation, fewer tuning parameters, and separate control of spectrum, excitation, and duration features.

Disadvantages of the SP approaches are the noise in the generated signal, known as vocoder, accuracy of acoustic modelling, and over smoothing.

Context Adaptive Training with Factorized Decision Tree

Other than text which is supposed to be converted to a speech, there are other affecting factors when training a speech synthesis system. Those factors are known as context. Here are some of the common types of context: Neighboring phones, position of phones, syllables, words and phrases with respect to the higher level units, number of phones, syllables, words and phrases with respect to the higher level units, syllable stress and accent status, linguistic roles, intended emotional state of the generated speech and other types of context. The common way of training a statistical parametric model for speech synthesis is to train an HMM for any combination of contexts. However, there might be not enough training samples available, as the richer the context, the more number of samples we need to learn the pattern.

Top-down decision tree-based context clustering is taken as a way of dealing with this problem, but still weak contexts will be suffering from the lack of instances (e.g. word level synthesis) as they have less influence on the likelihood.

One approach to address this problem is to split the decision tree construction to two stages of weak contexts and the remaining contexts questions. However, it doesn't work well due to the fragmented training data and insufficient amount of data in the clustering stage. Context-adaptive is proposed in this work as a better solution of the problem at hand.