# Introduction

Classical speech recognition systems use hidden Markov models (HMMs) to model the temporal variations and Gaussian mixture models (GMMs) to determine the likelihood of each state of each HMM given an acoustic observation. The speech signal in the classical systems are represented by a series of Mel-frequency cepstral coefficients (MFCCs) or perceptual linear predictive coefficients (PLPs) extracted from overlapping short windows of the raw speech signal. Although GMMs are quite flexible and are fairly easy to train using the Expectation Maximization (EM) algorithm, they are inefficient when modelling data that lie close to a nonlinear manifold which is the case for the speech data. Deep Neural Networks (DNNs) don't suffer from the same shortcoming, hence they can learn much better models than GMMs. Over the past few years, training DNNs has become possible thanks to the advancements in machine learning and computer hardware, which makes it possible to replace GMMs with DNNs in the speech recognition systems. DNNs are proved to outperform GMMs in both small and large vocabulary speech recognition tasks.

# Training Deep Neural Networks

DNNs are feed-forward neural networks that have multiple of hidden layers. The last layer is a softmax layer which gives the class probabilities. The weights for the DNNs are learnt using backpropagation algorithm, it was found empirically that computing the gradient using small random mini-batches is more efficient. To avoid overfitting, early stopping is used by stopping the training when the accuracy over validation set starts to decrease. The pretraining is essential when the amount of training data is small. Restricted Boltzmann Machines (RBMs) are used for pretraining except for the first layer which uses Gaussian-Bernoulli RBM (GRBM) since the input is real-value.

## Generative Pretraining

We would like to create a method which uses information in the training set to build multiple layers of nonlinear feature detectors. For this, the "generative pretraining" method is proposed. The concept is as follows: a feature detector that successfully models the structure in the input data, as opposed to one that distinguishes between classes, is the desired result. Thus, we learn one layer of features at a time, and then send these learned features into the next stage as training data. This stacked model structure can create features which are much more useful than raw data, and can help against overfitting.

The generative model chosen can be either a directed or undirected graph, with undirected being the choice in this paper. An undirected model is chosen because inference is easy as long as each hidden layer only contains connections to other layers, and no connections to itself. A Restricted Boltzmann Machine (RBM) is chosen in this case.

### Learning Procedure for RBMs

The energy function for the RBM is given by:

$E\left(\mathbf{v}, \mathbf{h}; \mathbf{W}\right) = - \sum_{i \in visible}a_iv_i - \sum_{j \in hidden}b_j h_j - \sum_{i, j} v_i h_j w_{ij}$, where

• $\mathbf{v}$ is the vector of visible units, with components $v_i$ and associated biases $a_i$
• $\mathbf{h}$ is the vector of hidden units, with components $h_j$ and associated biases $b_j$
• $\mathbf{W}$ is the weight matrix between the visible units and hidden units, with components $w_{ij}$

Then, the joint distribution function is given by:

$p\left(\mathbf{v}, \mathbf{h}; \mathbf{W} \right) = \frac{1}{Z} \mbox{ exp}\left[-E\left(\mathbf{v},\mathbf{h};\mathbf{W}\right)\right]$

where $Z$ is a normalization factor.

Using the law of total probability, we can obtain $p\left(\mathbf{v}\right) = \frac{1}{Z} \sum_{\mathbf{h}}\mbox{exp}\left[-E\left(\mathbf{v}, \mathbf{h}\right)\right]$. Now, we can obtain the derivative of the log probability of a training set with respect to a weight as: $\frac{1}{N} \sum_{n=1}^N \frac{\partial \mbox{ log } p\left(\mathbf{v}^n\right)}{\partial w_{ij}} = \lt v_ih_j\gt _{data} - \lt v_i h_j\gt _{model}$, where $\lt \gt$ denotes expectation.

We can easily obtain an unbiased sample of $\lt v_i h_j\gt _{data}$ since the conditional probabilities are as follows:

$p\left(h_j = 1 | \mathbf{v}\right) = \mbox{logistic}\left(b_j + \sum_{i} v_i w_{ij}\right)$

Obtaining an unbiased a sample of $\lt v_i h_j\gt _{model}$ is much more difficult though. Alternating Gibbs sampling is the ideal choice, but it can be slow. A faster procedure called "Contrastive Divergence" (CD) is used here instead, and it is similar to Gibbs sampling but terminates after only one full step of alternating Gibbs sampling. Even though CD only crudely approximates the gradient, it seems to perform well in practice. Also, since we are only pretraining the model, additional Gibbs sampling steps are not necessary, and the randomness produced by using CD may further help prevent overfitting.

# Interfacing a DNN with an HMM

HMM model requires the likelihoods of the observations $p(AcousticInput|HMMstate)$ for running the forward-backward algorithm or for computing a Viterbi alignment. DNNs output the posteriors $p(HMMstate|AcousticInput)$ which can be converted to scaled version of the likelihood by dividing them by $p(HMMstate)$, where $p(HMMstate)$ is the HMM states frequencies in the training data. The conversion from the posteriors to the likelihoods is important when the training labels are highly unbalanced.

# Phonetic Classification and Recognition on TIMIT

TIMIT is an acoustic-phonetic countinuous speech corpus that has been widely used as a benchmark data set for the speech recognition systems. DNN-HMM systems outperformed the classical GMM-HMM systems. The first successful attempt for building a DNN-HMM speech recognition system was published in 2009 by Mohamed et. al.<ref name=firstDBN> A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009. </ref>, they reported a significant improvement in the accuracy over the state-of-the-art DNN-HMM systems at that time. It was found that the structure of the DNN (i.e. number of hidden layers, and number of hidden units per layer) has little effect on the accuracy, which made it possible to focus more on learning the metaparameters of the DNN. Details of the learning rates, stopping criteria, momentum, L2 weight penalties and minibatch size, pretraining, and fine-tuning can be found in <ref name=tuning_fb_DBN>A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22, Jan. 2012.</ref>.

## Using Filter-Bank Features

MFCC features are commonly used in the GMM-HMM systems because they provide uncorrelated features, which is important to avoid using full covariance GMMs. Some of the acoustic information is lost due to using MFCCs. DNNs on the other hand can work with correlated features which opened the room for using filter-bank features. It was found that using filter-bank features with DNNs improved the accuracy by 1.7% <ref name=tuning_fb_DBN></ref>.

## Fine-Tuning DNNs To Optimize Mutual Information

In the experiments mentioned earlier this section, the system were tuned to optimize the per frame cross entropy or the log posterior probablity $p(l_t|v_t)$; where $l_t$ is label at time $t$, and $v_t$ is the features at the same time step. The transition probabilities and the language models were tuned independently using the HMM framework. The DNN can be tuned to optimize the conditional probability $p(l_{1:T}|v_{1:T})$, this is done for the softmax layer only and by fixing the parameters of the hidden layers $h$.

$p(l_{1:T}|v_{1:T}) = p(l_{1:T}|h_{1:T}) = \frac{\exp(\sum_{t=1}^T\gamma_{ij} \phi_{ij}(l_{t-1},l_t) + \sum_{t=1}^T\sum_{d=1}^D \lambda_{l_t,d} h_{td})}{Z(h_{1:T})}$

Where $\phi_{i,j}(l_{t-1},l_t)$ is the transition feature and it takes a value of one if $l_{t-1} = i$ and $l_{t} = j$ and zero otherwise, $\gamma_{ij}$ is the parameter associated with the transition feature, $\lambda$ are the weights of the softmax layer. $\gamma,\lambda$ are tuning using gradient descent, and the experiments show the fine tuning DNNs to optimize the mutual information improved the accuracy by 5% relative on TIMIT data set <ref name=finetuningDNN> A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp. 2846–2849. </ref>.

## Convolutional DNNs for Phone Classification and Recognition

Convolutional DNNs were introduced in 2009 and they were applied to various audio tasks including TIMIT dataset <ref> H. Lee, P. Pham, Y. Largman, and A. Ng, “Unsupervised feature learning for audio classification using convolutional deep belief networks,” in Advances in Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2009, pp. 1096–1104. </ref>. In this work, convolutional DNNs were applied on the temporal dimension in order to extract same features at different times. Since the temporal variations are already handled by the HMM, Abdelhameed et. al. proposed to apply convolutional DNNs on the frequency domain instead <ref name=convDNN> O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition,” in Proc. ICASSP, 2012, pp. 4277–4280. </ref>. Weight-sharing and max-pooling was used for the nearby frequencies because acoustic features for different frequencies are very different. They achieved the least phone error rate of 20% on TIMIT dataset reported at that time.

## DNNs and GMMs

As you can see in the following table, monophone DNNs can outperform the best triphone GMM-HMM by 1.7%, the reason behind such success can be due to the following:

1. DNNs are instance of product of experts in which each parameter is constrained by large amount of the data, while GMMs are sum of experts for which each parameter applies to a small amount of the data.
2. GMM assumes that each datapoint is generated from a single component which makes it inefficient when it comes to modeling multiple simultaneous events. DNNs is flexible enough to model multiple simultaneous events.
3. GMMs are restricted to uncorrelated features, while GMMs can work with correlated features. This allows GMMs to use correlated features such as filter-banks, it also allows to analyse a larger window of the signal at each timestep.
Compariosons among the reported speaker-independent (SI) phonetic error rate (PER) results on TIMIT core test set with 192 sentences.
Method PER
CD-HMM <ref name=cdhmm>

Y. Hifny and S. Renals, “Speech recognition using augmented conditional random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp. 354–365, 2009. </ref>

27.3%
Augmented Conditional Random Fienlds <ref name=cdhmm></ref> 26.6%
Randomly Initialized Recurrent Neural Nets <ref name=rirnn>

A. Robinson, “An application to recurrent nets to phone probability estimation,” IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994. </ref>

26.1%
Bayesian Triphone GMM-HMM <ref name=btgmmhmm>

J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone models,” in Proc. ICASSP, 1998, pp. 409–412. </ref>

25.6%
Monophone HTMs <ref name=mhtms>

L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp. 445–448. </ref>

24.8%
Heterogeneous Classifiers <ref name=hclass>

A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple classifiers for speech recognition,” in Proc. ICSLP, 1998. </ref>

24.4%
Monophone Randomly Initialized DNNs (Six Layers) <ref name=tuning_fb_DBN></ref> 23.4%
Monophone DBN-DNNs (Six Layers) <ref name=tuning_fb_DBN></ref> 22.4%
Monophone DBN-DNNs with MMI Training <ref name=finetuningDNN></ref> 22.1%
Triphone GMM-HMMs DT W/BMMI <ref name=tgmmhmmbmmi>

T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky, “Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011. </ref>

21.7%
Monophone DBN-DNNs on FBank (Eight Layers) <ref name=tuning_fb_DBN></ref> 20.7%
Monophone MCRBM-DBN-DNNs On FBank (Five Layers) <ref name=mmcrbmdbndnn>

G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition with the mean-covariance restricted Boltzmann machine,” in Advances in Neural Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe- Taylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp. 469–477. </ref>

20.5%
Monphone Convolutional DNNs On Fbank (Three Layers) <ref name=cdnnfb>

O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition,” in Proc. ICASSP, 2012, pp. 4277–4280. </ref>

20.0%

# DNN for Large-Vocabulary Speech Recognition

The success of DNN-HMM system on TIMIT data set opened the door for trying the same technique on larger data sets. It was found that using context dependent HMM states is essential for large data sets. DNN-HMM systems were tested on five different large tasks and they outperform GMM-HMM systems on every task.

Bing-Voice-Search is a 24 hrs of speech data with different sources of acoustic variations such as noise, music, side-speech, accents, sloppy pronunciations, interruptions, and mobile phone differences. The DNN-HMM system was trained based on the DNN that worked well for TIMIT. it contained 5 hidden layers of size 2048 each. A window of 11 frames was used to classify the middle frame into the corresponding HMM state, and tri-phone states were used instead of monophones. The DNN-HMM system achieved a sentence accuracy of 69% compared to 63.8% for the GMM-HMM system. The accuracy for the DNN-HMM system was further improved by increasing the data size to 48 hrs achieving a sentence accuracy of 71.7% <ref name=bing> G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012. </ref>.

This data set contains over 300 hrs of speech data, and the test set is 6.3 hrs of speech. The same DNN-HMM system developed for Bing data set was applied to this data set. The DNN used contains 7 hidden layers of size 2048 each. A trigram language model was used, it was trained using 2000 hrs of speech corpus. This data set is publicly available which allows for rigorous comparisons among different techniques to be performed. As shown in the table below, DNN reduced the word error rate from 27.4% to 18.5% <ref name=switchboard> F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp. 437–440. </ref> . DNN system performed as well as a GMM system that combines several speaker-adaptive multipass systems and uses nearly 7 times as much acoustic data for training as the DNN system.

Comparing five different DBN-DNN acoustic models with two strong GMM-HMM baseline systems that are discriminatively trained. "40 Mix" means a mixture of 40 gaussians per hmm state. Word error rates in % are shown for two separate test sets, HUB500-SWB and RT03S-FSH.
Technique HUB5'00-SWB RT03S-FSH
GMM, 40 MIX DT 309H SI 23.6 27.4
NN 1 HIDDEN-LAYER x 4,634 UNITS 26.0 29.4
+ 2 x 5 NEIGHBORING FRAMES 22.4 25.7
DBN-DNN 7 HIDDEN LAYERS x 2,048 UNITS 17.1 19.6
+ UPDATED STATE ALIGNMENT 16.4 18.6
+ SPARSIFICATION 16.1 18.5
GMM 72 MIX DT 2000H SA 17.1 18.6

The data set contains search engine queries, short messages, events, and user actions from mobile devices. A well-trained GMM-HMM system was used to align 5870 hrs of speech for the DNN-HMM system. The DNN used is 4 hidden layer with 2560 hidden units per layer. A window of 11 frames was used to classify the middle frame, and 40 log filter-bank features were used to represent each frame. The network was fine-tuned to maximize the mutual information. The DNN was sparsified by setting weights that are less that certain threshold to zero. To further improve the accuracy, the DNN-HMM and the GMM-HMM models were combined using segmental conditional random field framework <ref name=scrf> N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained deep neural networks to large vocabulary speech recognition,” submitted for publication. </ref>. Using DNN-HMM reduced the word error rate by 23% relative achieving a word error rate of 12.3%. combining both models (GMM and DNN) further reduced the word error rate to 11.8%.

The goal for this task is to transcribe YouTube data. This type of data doesn't have strong language model to constraint the interpretation of the speech information, so a strong acoustic model is essential. A well-trained GMM-HMM system was used to align a 1400 hrs of data for the DNN model. A decision tree clustering algorithm was used to cluster the HMM states into 17552 context-dependent tri-phone states. Feature space maximum likelihood linear regression (fMLLR)-transformed features were used. Due to the large number of HMM states, only 4 hidden layer were used to save computation resources for the large softmax layer. After training, sequence-level fine tuning was performed. Also, both DNN-HMM and GMM-HMM models were combined to improve the accuracy. DNN-HMM system reduced the word error rate by 4.7% absolute, fine tuning reduced the word error rate by more 0.5%, and combining both models gave another reduction of 0.9%.

A GUMM-HMM baseline system was used to align 50 hrs of speech from 1996 and 1997 ENglish Broadcast News Speech Corpora. SAT and DT features were used to train the DNN. The network consists of six hidden layers of 1024 units each. The number of HMM states is 2220 triphone states. A window of 9 frames was used to classify the middle frame. Fine tuning was done after training to optimize the mutual information. The DNN-HMM system achieved a word error rate of 17.5% compared to 18.8% for the best GMM-HMM system <ref name=broadcastDNN> T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Improvements in using deep belief networks for large vocabulary continuous speech recognition,” Speech and Language Algorithm Group, IBM, Yorktown Heights, NY, Tech. Rep. UTML TR 2010-003, Feb. 2011. </ref>.

## Summary for the Main Results for DNN Acoustic Models on Large Data Sets

The following table shows the performance of DNN-HMMs compared to GMM-HMMs, and it is obvious how DNNs are superior to GMM in terms of the accuracies.

A comparison of the percentage WERs using DNN-HMMs and GMM-HMMs on five different large vocabulary tasks.
Task Hours of training data DNN-HMM GMM-HMM GMM-HMM using larger training data
SWITCHBOARD (TEST SET 1) 309 18.5 27.4 18.6 (2000h)
SWITCHBOARD (TEST SET 2) 309 16.1 23.6 17.1 (2000h)
ENGLISH BROADCAST NEWS 50 17.5 18.8
BING VOICE SEARCH (sentence error rates) 24 30.4 36.2
GOOGLE VOICE INPUT 5870 12.3 16.0 (>> 5870h)

# Alternative Pretraining Methods for DNNs

Pretraining DNNs was reported to improve the results on TIMIT and large data sets tasks. The pretraining was done generatively using a stack of RBMs. Another way for pretraining is discriminative pretraining. In discriminative pretraining, we start from a shallow network and the weights are trained discriminatively. After that, another hidden layer is added between the last hidden layer and the softmax layer, then the weights for the new added layer is again discrimintively learned and so on. Finally backpropagation fine-tuning for the whole network is applied. This way of training was reported to achieve the same results achieved by generative pretraining <ref name=discrDNN> F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription,” in Proc. IEEE ASRU, 2011, pp. 24–29. </ref>.

# Conclusions and Discussions

GMMs have been used widely for acoustic modelling, they are easy to train and quite flexible. Since 2009, DNNs were proposed to replace GMMs, they have been proven to be superior to GMMs in many speech recognition tasks. DNNs pretraining is essential when the amount of data is small, it reduces overfitting and the convergence time. Fine-tuning the network to optimize the mutual information can improve the results. The authors think that there are yet many things that can be done in pretraining, fine-tuning, and using different types of hidden units to further increase the performance of DNNs.

This paper summarizes the recent research that has been done by three research groups in the area of speech recognition. They have shown that DNNs are superior to GMMs in both small and large dataset speech recognition tasks. The authors claimed that the reason for such superiority is that the speech data lies on a manifold which I am not sure if there is any scientific/empirical proof for such claim. This paper is an excellent source if someone is interested in deep learning in speech recognition. The author assume that the readers are familiar with speech recognition HMM framework, otherwise it will be difficult to follow. Also, the DNN area is moving fast, and the information in the paper is not up to date.

<references />