contributions on Context Adaptive Training with Factorized Decision Trees for HMM-Based Speech Synthesis

From statwiki
Jump to navigation Jump to search

Speech synthesis vs. speech recognition

As mentioned in the original paper, speech synthesis requires a much larger and more complex set of contexts in order to achieve high quality synthesised speech. Examples of such contexts are the following:

  • Identity of neighbouring phones to the central phone. Two phones to the left and the right of the centre phone are usually considered as phonetic neighbouring contexts
  • Position of phones, syllables, words and phrases w.r.t. higher level units
  • Number of phones, syllables, words and phrases w.r.t. higher level units
  • Syllable stress and accent status
  • Linguistic role, e.g. part-of-speech tag
  • Emotion and emphasis

Notes

There are many factors that could affect the acoustic realisation of phones. The prior knowledge of such factors form the questions used in the decision tree based state clustering procedure. Some questions are highly correlated, e.g. the phonetic broad class questions and the syllable questions. Some others are not, like the example mentioned in the paper (phonetic broad class questions and emphasis questions).

MLLR based approach

let's rewrite the first equation in (4) as:


[math]\displaystyle{ \begin{matrix} \hat \mu_{r_{c}} = A_{r_{e}}\mu_{r_{p}} + b_{r_{e}} = W_{r_{e}(m)}\xi_{r_{p}(m)}\\ \hat \sum_{r_{c}} = \sum_{r_{p}(m)} \end{matrix} }[/math]

let m be used instead of [math]\displaystyle{ r_c }[/math] to denote the index of the atomic state cluster.