deep neural networks for acoustic modeling in speech recognition

From statwiki
Revision as of 14:51, 3 November 2015 by Arashwan (talk | contribs)
Jump to navigation Jump to search

Introduction

Classical speech recognition systems use hidden Markov models (HMMs) to model the temporal variations and Gaussian mixture models (GMMs) to determine the likelihood of each state of each HMM given an observation of a small window of a speech signal. The speech signal in the classical systems are represented by a series of Mel-frequency cepstral coefficients (MFCCs) or perceptual linear predictive coefficients (PLPs) extracted from overlapping short windows of the raw speech signal. Although GMMs are quite flexible and are fairly easy to train using the Expectation Maximization (EM) algorithm, they are inefficient when modelling data that lie close to a nonlinear manifold which is the case for the speech data. Deep Neural Networks (DNNs) don't suffer from the same shortcoming, hence they can learn much better models than GMMs. Over the past few years, training DNNs has become possible thanks to the advancements in machine learning and computer hardware, which makes it possible to replace GMMs with DNNs in the speech recognition systems. DNNs are proved to outperform GMMs in both small and large vocabulary speech recognition tasks.

Training Deep Neural Networks

Interfacing a DNN with an HMM

HMM model requires the likelihoods of the observations [math]\displaystyle{ p(AcousticInput|HMMstate) }[/math] for running the forward-backward algorithm or for computing a Viterbi alignment. DNNs output the posteriors [math]\displaystyle{ p(HMMstate|AcousticInput) }[/math] which can be converted to scaled version of the likelihood by dividing them by [math]\displaystyle{ p(HMMstate) }[/math], where [math]\displaystyle{ p(HMMstate) }[/math] is the HMM states frequencies in the training data. The conversion from the posteriors to the likelihoods is important when the training labels are highly unbalanced.

Phonetic Classification and Recognition on TIMIT

TIMIT is an acoustic-phonetic countinuous speech corpus that has been widely used as a benchmark data set for the speech recognition systems. DNN-HMM systems outperformed the classical GMM-HMM systems. The first successful attempt for building a DNN-HMM speech recognition system was published in 2009 by Mohamed et. al.<ref name=firstDBN> A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009. </ref>.

Convolutional DNNs for Phone Classification and Recognition

DNNs and GMMs

DNN for Large-Vocabulary Speech Recognition

Bing-Voice-Search Speech Recognition Task

Switchboard Speech Recognition Task

Google Voice Input Speech Recognition Task

Youtube Speech Recognition Task

English Broadcast News Speech Recognition Task

Alternative Pretraining Methods for DNNs

Conclusions and Discussions

References

<references />