contributions on Context Adaptive Training with Factorized Decision Trees for HMM-Based Speech Synthesis

From statwiki
Revision as of 08:57, 7 November 2011 by Tameem (talk | contribs) (Created page with "==Speech synthesis vs. speech recognition== As mentioned in the original paper, speech synthesis requires a much larger and more complex set of contexts in order to achieve high ...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Speech synthesis vs. speech recognition

As mentioned in the original paper, speech synthesis requires a much larger and more complex set of contexts in order to achieve high quality synthesised speech. Examples of such contexts are the following:

  • Identity of neighbouring phones to the central phone. Two phones to the left and the right of the centre phone are usually considered as phonetic neighbouring contexts
  • Position of phones, syllables, words and phrases w.r.t. higher level units
  • Number of phones, syllables, words and phrases w.r.t. higher level units
  • Syllable stress and accent status
  • Linguistic role, e.g. part-of-speech tag
  • Emotion and emphasis