contributions on Context Adaptive Training with Factorized Decision Trees for HMM-Based Speech Synthesis

From statwiki
Revision as of 08:57, 7 November 2011 by Tameem (talk | contribs) (Created page with "==Speech synthesis vs. speech recognition== As mentioned in the original paper, speech synthesis requires a much larger and more complex set of contexts in order to achieve high ...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Speech synthesis vs. speech recognition

As mentioned in the original paper, speech synthesis requires a much larger and more complex set of contexts in order to achieve high quality synthesised speech. Examples of such contexts are the following:

  • Identity of neighbouring phones to the central phone. Two phones to the left and the right of the centre phone are usually considered as phonetic neighbouring contexts
  • Position of phones, syllables, words and phrases w.r.t. higher level units
  • Number of phones, syllables, words and phrases w.r.t. higher level units
  • Syllable stress and accent status
  • Linguistic role, e.g. part-of-speech tag
  • Emotion and emphasis