deep neural networks for acoustic modeling in speech recognition
Introduction
Classical speech recognition systems use hidden Markov models (HMMs) to model the temporal variations and Gaussian mixture models (GMMs) to determine the likelihood of each state of each HMM given an acoustic observation. The speech signal in the classical systems are represented by a series of Mel-frequency cepstral coefficients (MFCCs) or perceptual linear predictive coefficients (PLPs) extracted from overlapping short windows of the raw speech signal. Although GMMs are quite flexible and are fairly easy to train using the Expectation Maximization (EM) algorithm, they are inefficient when modelling data that lie close to a nonlinear manifold which is the case for the speech data. Deep Neural Networks (DNNs) don't suffer from the same shortcoming, hence they can learn much better models than GMMs. Over the past few years, training DNNs has become possible thanks to the advancements in machine learning and computer hardware, which makes it possible to replace GMMs with DNNs in the speech recognition systems. DNNs are proved to outperform GMMs in both small and large vocabulary speech recognition tasks.
Training Deep Neural Networks
Interfacing a DNN with an HMM
HMM model requires the likelihoods of the observations [math]\displaystyle{ p(AcousticInput|HMMstate) }[/math] for running the forward-backward algorithm or for computing a Viterbi alignment. DNNs output the posteriors [math]\displaystyle{ p(HMMstate|AcousticInput) }[/math] which can be converted to scaled version of the likelihood by dividing them by [math]\displaystyle{ p(HMMstate) }[/math], where [math]\displaystyle{ p(HMMstate) }[/math] is the HMM states frequencies in the training data. The conversion from the posteriors to the likelihoods is important when the training labels are highly unbalanced.
Phonetic Classification and Recognition on TIMIT
TIMIT is an acoustic-phonetic countinuous speech corpus that has been widely used as a benchmark data set for the speech recognition systems. DNN-HMM systems outperformed the classical GMM-HMM systems. The first successful attempt for building a DNN-HMM speech recognition system was published in 2009 by Mohamed et. al.<ref name=firstDBN> A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009. </ref>, they reported a significant improvement in the accuracy over the state-of-the-art DNN-HMM systems at that time. It was found that the structure of the DNN (i.e. number of hidden layers, and number of hidden units per layer) has little effect on the accuracy, which made it possible to focus more on learning the metaparameters of the DNN. Details of the learning rates, stopping criteria, momentum, L2 weight penalties and minibatch size, pretraining, and fine-tuning can be found in <ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22, Jan. 2012.</ref>.
Using Filter-Bank Features
MFCC features are commonly used in the GMM-HMM systems because they provide uncorrelated features, which is important to avoid using full covariance GMMs. Some of the acoustic information is lost due to using MFCCs. DNNs on the other hand can work with correlated features which opened the room for using filter-bank features. It was found that using filter-bank features with DNNs improved the accuracy by 1.7% <ref name=tuning_fb_DBN></ref>.
Fine-Tuning DNNs To Optimize Mutual Information
In the experiments mentioned earlier this section, the system were tuned to optimize the per frame cross entropy or the log posterior probablity [math]\displaystyle{ p(l_t|v_t) }[/math]; where [math]\displaystyle{ l_t }[/math] is label at time [math]\displaystyle{ t }[/math], and [math]\displaystyle{ v_t }[/math] is the features at the same time step. The transition probabilities and the language models were tuned independently using the HMM framework. The DNN can be tuned to optimize the conditional probability [math]\displaystyle{ p(l_{1:T}|v_{1:T}) }[/math], this is done for the softmax layer only and by fixing the parameters of the hidden layers [math]\displaystyle{ h }[/math].
[math]\displaystyle{ (l_{1:T}|v_{1:T}) = p(l_{1:T}|h_{1:T}) = \frac{\exp(\sum_{t=1}^T\gamma_{ij} \phi_{ij}(l_{t-1},l_t) + \sum_{t=1}^T\sum_{d=1}^D \lambda_{l_t,d} h_{td})}{Z(h_{1:T})} }[/math]
Where [math]\displaystyle{ \phi_{i,j}(l_{t-1},l_t) }[/math] is the transition feature and it takes a value of one if [math]\displaystyle{ l_{t-1} = i }[/math] and [math]\displaystyle{ l_{t} = j }[/math] and zero otherwise, [math]\displaystyle{ \gamma_{ij} }[/math] is the parameter associated with the transition feature, [math]\displaystyle{ \lambda }[/math] are the weights of the softmax layer. [math]\displaystyle{ \gamma,\lambda }[/math] are tuning using gradient descent, and the experiments show the fine tuning DNNs to optimize the mutual information improved the accuracy by 5% relative on TIMIT data set <ref name=finetuningDNN> A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp. 2846–2849. </ref>.
Convolutional DNNs for Phone Classification and Recognition
DNNs and GMMs
As you can see in the following table, monophone DNNs can outperform the best triphone GMM-HMM by 1.7%, the reason behind such success can be due to the following:
- DNNs are instance of product of experts in which each parameter is constrained by large amount of the data, while GMMs are sum of experts for which each parameter applies to a small amount of the data.
- GMM assumes that each datapoint is generated from a single component which makes it inefficient when it comes to modeling multiple simultaneous events. DNNs is flexible enough to model multiple simultaneous events.
- GMMs are restricted to uncorrelated features, while GMMs can work with correlated features. This allows GMMs to use correlated features such as filter-banks, it also allows to analyse a larger window of the signal at each timestep.
Method | PER |
---|---|
CD-HMM <ref name=cdhmm>
Y. Hifny and S. Renals, “Speech recognition using augmented conditional random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp. 354–365, 2009. </ref> |
27.3% |
Augmented Conditional Random Fienlds <ref name=cdhmm></ref> | 26.6% |
Randomly Initialized Recurrent Neural Nets <ref name=rirnn>
A. Robinson, “An application to recurrent nets to phone probability estimation,” IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994. </ref> |
26.1% |
Bayesian Triphone GMM-HMM <ref name=btgmmhmm>
J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone models,” in Proc. ICASSP, 1998, pp. 409–412. </ref> |
25.6% |
Monophone HTMs <ref name=mhtms>
L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp. 445–448. </ref> |
24.8% |
Heterogeneous Classifiers <ref name=hclass>
A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple classifiers for speech recognition,” in Proc. ICSLP, 1998. </ref> |
24.4% |
Monophone Randomly Initialized DNNs (Six Layers) <ref name=tuning_fb_DBN></ref> | 23.4% |
Monophone DBN-DNNs (Six Layers) <ref name=tuning_fb_DBN></ref> | 22.4% |
Monophone DBN-DNNs with MMI Training <ref name=finetuningDNN></ref> | 22.1% |
Triphone GMM-HMMs DT W/BMMI <ref name=tgmmhmmbmmi>
T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky, “Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011. </ref> |
21.7% |
Monophone DBN-DNNs on FBank (Eight Layers) <ref name=tuning_fb_DBN></ref> | 20.7% |
Monophone MCRBM-DBN-DNNs On FBank (Five Layers) <ref name=mmcrbmdbndnn>
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition with the mean-covariance restricted Boltzmann machine,” in Advances in Neural Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe- Taylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp. 469–477. </ref> |
20.5% |
Monphone Convolutional DNNs On Fbank (Three Layers) <ref name=cdnnfb>
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition,” in Proc. ICASSP, 2012, pp. 4277–4280. </ref> |
20.0% |
DNN for Large-Vocabulary Speech Recognition
Bing-Voice-Search Speech Recognition Task
Switchboard Speech Recognition Task
Google Voice Input Speech Recognition Task
Youtube Speech Recognition Task
English Broadcast News Speech Recognition Task
Alternative Pretraining Methods for DNNs
Conclusions and Discussions
References
<references />