graves et al., Speech recognition with deep recurrent neural networks

From statwiki
Jump to: navigation, search


This document is a summary of the paper Speech recognition with deep recurrent neural networks by A. Graves, A.-R. Mohammed, and G. Hinton, which appeared in the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). The first and third authors are Artificial Neural Network (ANN) researchers, while Mohammed works in the field of automatic speech recognition.

The paper presents the application of bidirectional multilayer Long Short-term Memory (LSTM) ANNs with 1–5 layers to phoneme recognition on the TIMIT acoustic phonme corpus, which is the standard benchmark in the field of acoustic recognition, extending the previous work by Mohammed and Hinton on this topic using Deep Belief Networks. The TIMIT corpus contains audio recordings of 6300 sentences spoken by 630 (American) English speakers from 8 regions with distinct dialects, where each recording has accompanying manually labelled transcriptions of the phonemes in the audio clips alongside timestamp information. The empirical classification accuracies reported in the literature before the publication of this paper are shown in the timeline below (note that in this figure, the accuracy metric is 100% - PER, where PER is the phoneme classification error rate).

The deep LSTM networks presented with 3 or more layers obtain phoneme classification error rates of 19.6% or less, with one model obtaining 17.7%, which was the best result reported in the literature at the time, outperforming the previous record of 20.7% achieved by Mohammed et al. Furthermore, the error rate decreases monotonically with LSTM network depth for 1–5 layers. While the bidirectional LSTM model performs well on the TIMIT corpus, any potential advantage of bidirectional over unidirectional LSTM network models, cannot be determined from this paper since the performance comparison is across different numbers of iterations taken in the optimization algorithm used to train the models, and multiple trials for statistical validity were not performed.

Timeline of percentage phoneme recognition accuracy achieved on the core TIMIT corpus, from Lopes and Perdigao, 2011.


Neural networks have been trained for speech recognition problems, however usually in combination with hidden Markov Models. The authors in this paper argue that given the nature of speech is an inherently dynamic process RNN should be the ideal choice for such a problem. There has been attempts to train RNNs for speech recognition <ref>A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Un-segmented Sequence Data with Recurrent Neural Networks” in ICML, Pittsburgh, USA, 2006.</ref> <ref> A. Graves, Supervised sequence labelling with recurrentneural networks, vol. 385, Springer, 2012.</ref> <ref> A. Graves, “Sequence transduction with recurrent neural networks” in ICML Representation Learning Work-sop, 2012.</ref> and RNNs with LSTM for recognizing cursive handwriting <ref> A. Graves, S. Fernandez, M. Liwicki, H. Bunke, and J. Schmidhuber, “Unconstrained Online Handwriting Recognition with Recurrent Neural Networks” in NIPS.2008.</ref> but neither has made an impact on the speech recognition. The authors drew inspiration from Convolutional Neural Networks, where multiple layers are stacked on top of each other to combine LSTM and RNNs together.

However instead of using a conventional RNN which only considers previous contexts, a Bidirectional RNN <ref> M. Schuster and K. K. Paliwal, “Bidirectional Recurrent Neural Networks” IEEE Transactions on Signal Processing, vol. 45, pp. 2673–2681, 1997.</ref> was used to consider both forward and backward contexts. This is due in part because the authors saw no reason not to exploit future contexts since the speech utterances are transcribed at once. Additionally BRNN has the added benefit of being able to consider the entire forward and context, not just some predefined window of forward and backward contexts.

Deep RNN models considered by Graves et al.

In this paper, Graves et al. use deep LSTM network models. We briefly review recurrent neural networks, which form the basis of the more complicated LSTM network that has composite [math]\mathcal{H}[/math] functions instead of sigmoids and additional parameter vectors associated with the state of each neuron. Finally, a description of bidirectional ANNs is given, which is used throughout the numerical experiments.

Recurrent Neural Networks

Recall that a standard 1-layer recurrent neural network (RNN) computes the hidden vector sequence [math]{\boldsymbol h} = ({{\mathbf{h}}}_1,\ldots,{{\mathbf{h}}}_T)[/math] and output vector sequence [math]{{\boldsymbol {{\mathbf{y}}}}}= ({{\mathbf{y}}}_1,\ldots,{{\mathbf{y}}}_T)[/math] from an input vector sequence [math]{{\boldsymbol {{\mathbf{x}}}}}= ({{\mathbf{x}}}_1,\ldots,{{\mathbf{x}}}_T)[/math] through the following equation where the index is from [math]t=1[/math] to [math]T[/math]:

[math]{{\mathbf{h}}}_t = \begin{cases} {\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{\mathbf{b_{h}}}}}\right) &\quad t = 1\\ {\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{{\mathbf{W}}}_{h h}}}{{\mathbf{h}}}_{t-1} + {{{\mathbf{b_{h}}}}}\right) &\quad \text{else} \end{cases}[/math]


[math]{{\mathbf{y}}}_t = {{{{\mathbf{W}}}_{h y}}}{{\mathbf{h}}}_t + {{{\mathbf{b_{y}}}}}.[/math]

The [math]{{\mathbf{W}}}[/math] terms are the parameter matrices with subscripts denoting the layer location (e.g. [math]{{{{\mathbf{W}}}_{x h}}}[/math] is the input-hidden weight matrix), and the offset [math]b[/math] terms are bias vectors with appropriate subscripts (e.g. [math]{{{\mathbf{b_{h}}}}}[/math] is hidden bias vector). The function [math]{\mathcal{H}}[/math] is an elementwise vector function with a range of [math][0,1][/math] for each component in the hidden layer.

This paper considers multilayer RNN architectures, with the same hidden layer function used for all [math]N[/math] layers. In this model, the hidden vector in the [math]n[/math]th layer, [math]{\boldsymbol h}^n[/math], is generated by the rule

[math]{{\mathbf{h}}}^n_t = {\mathcal{H}}\left({{\mathbf{W}}}_{h^{n-1}h^{n}} {{\mathbf{h}}}^{n-1}_t + {{\mathbf{W}}}_{h^{n}h^{n}} {{\mathbf{h}}}^n_{t-1} + {{{\mathbf{b_{h}}}}}^n \right),[/math]

where [math]{\boldsymbol h}^0 = {{\boldsymbol {{\mathbf{x}}}}}[/math]. The final network output vector in the [math]t[/math]th step of the output sequence, [math]{{\mathbf{y}}}_t[/math], is

[math]{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{h^N y}} {{\mathbf{h}}}^N_t + {{{\mathbf{b_{y}}}}}.[/math]

This is pictured in the figure below for an arbitrary layer and time step.

File:rnn graves.png
Fig 1. Schematic of a Recurrent Neural Network at an arbitrary layer and time step.

Long Short-term Memory Architecture

Graves et al. consider a Long Short-Term Memory (LSTM) architecture from Gers et al. . This model replaces [math]\mathcal{H}(\cdot)[/math] by a composite function that incurs additional parameter matrices, and hence a higher dimensional model. Each neuron in the network (i.e.  row of a parameter matrix [math]{{\mathbf{W}}}[/math]) has an associated state vector [math]{{\mathbf{c}}}_t[/math] at step [math]t[/math], which is a function of the previous [math]{{\mathbf{c}}}_{t-1}[/math], the input [math]{{\mathbf{x}}}_t[/math] at step [math]t[/math], and the previous step’s hidden state [math]{{\mathbf{h}}}_{t-1}[/math] as

[math]{{\mathbf{c}}}_t = {{\mathbf{f}}}_t \circ {{\mathbf{c}}}_{t-1} + {{\mathbf{i}}}_t \circ \tanh \left({{{\mathbf{W}}}_{x {{\mathbf{c}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{c}}}}} {{\mathbf{h}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{c}}}\right)[/math]

where [math]\circ[/math] denotes the Hadamard product (elementwise vector multiplication), the vector [math]{{\mathbf{i}}}_t[/math] denotes the so-called input vector to the cell that generated by the rule

[math]{{\mathbf{i}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{i}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{i}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}} {{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{i}}}\right),[/math]

and [math]{{\mathbf{f}}}_t[/math] is the forget gate vector, which is given by

[math]{{\mathbf{f}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{f}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{f}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{f}}}}} {{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{f}}}\right)[/math]

Each [math]{{\mathbf{W}}}[/math] matrix and bias vector [math]{{\mathbf{b}}}[/math] is a free parameter in the model and must be trained. Since [math]{{\mathbf{f}}}_t[/math] multiplies the previous state [math]{{\mathbf{c}}}_{t-1}[/math] in a Hadamard product with each element in the range [math][0,1][/math], it can be understood to reduce or dampen the effect of [math]{{\mathbf{c}}}_{t-1}[/math] relative to the new input [math]{{\mathbf{i}}}_t[/math]. The final hidden output state is then

[math]{{\mathbf{h}}}_t = \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{o}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{o}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{o}}}}} {{\mathbf{c}}}_{t} + {{\mathbf{b}}}_{{\mathbf{o}}}\right)\circ \tanh({{\mathbf{c}}}_t)[/math]

In all of these equations, [math]\sigma[/math] denotes the logistic sigmoid function. Note furthermore that [math]{{\mathbf{i}}}[/math], [math]{{\mathbf{f}}}[/math], [math]{{\mathbf{o}}}[/math] and [math]{{\mathbf{c}}}[/math] all of the same dimension as the hidden vector [math]h[/math]. In addition, the weight matrices from the cell to gate vectors (e.g. [math]{{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}[/math]) are diagonal, such that each parameter matrix is merely a scaling matrix.

Bidirectional RNNs

A bidirectional RNN adds another layer of complexity by computing 2 hidden vectors per layer. Neglecting the [math]n[/math] superscripts for the layer index, the forward hidden vector is determined through the conventional recursion as

[math]{\overrightarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overrightarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}{\overrightarrow{{{\mathbf{h}}}}}}} {\overrightarrow{{{\mathbf{h}}}}}_{t-1} + {{\mathbf{b_{{\overrightarrow{{{\mathbf{h}}}}}}}}} \right),[/math]

while the backward hidden state is determined recursively from the reversed sequence [math]({{\mathbf{x}}}_T,\ldots,{{\mathbf{x}}}_1)[/math] as

[math]{\overleftarrow{{{\mathbf{h}}}}}_t = {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overleftarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}{\overleftarrow{{{\mathbf{h}}}}}}} {\overleftarrow{{{\mathbf{h}}}}}_{t+1} + {{\mathbf{b_{{\overleftarrow{{{\mathbf{h}}}}}}}}}\right).[/math]

The final output for the single layer state is then an affine transformation of [math]{\overrightarrow{{{\mathbf{h}}}}}_t[/math] and [math]{\overleftarrow{{{\mathbf{h}}}}}_t[/math] as [math]{{\mathbf{y}}}_t = {{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}y}} {\overrightarrow{{{\mathbf{h}}}}}_t + {{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}y}} {\overleftarrow{{{\mathbf{h}}}}}_t + {{{\mathbf{b_{y}}}}}.[/math]

The combination of LSTM with bidirectional dependence has been previously used by Graves and Schmidhuber, and is extended to multilayer networks in this paper. The motivation for this is to use dependencies on both prior and posterior vectors in the sequence to predict a given output at any time step. In other words, a forward and backward context is used.

Network Training for Phoneme Recognition

This section describes the phoneme classification experiments performed with the TIMIT corpus. An overview of the input timeseries audio data preprocessing into frequency domain vectors is given, and the optimization techniques are described.

Frequency Domain Processing

Recall that for a real, periodic signal [math]{f(t)}[/math], the Fourier transform

[math]{F(\omega)}= \int_{-\infty}^{\infty}e^{-i\omega t} {f(t)}dt[/math]

can be represented for discrete samples [math]{f_0, f_1, \cdots f_{N-1}}[/math] as

[math]F_k\ = \sum_{n=0}^{N-1} x_n \cdot e^{-i \frac{k n \pi}{N}}[/math],

where [math]{F_k}[/math] are the discrete coefficients of the (amplitude) spectral distribution of the signal [math]f[/math] in the frequency domain. This is a particularly powerful representation of audio data, since the modulation produced by the tongue and lips when shifting position while the larynx vibrates induces the frequency changes that make up a phonetic alphabet. A particular example is the spectrogram below. A spectrogram is a heat map representation of a matrix of data in which each pixel has intensity proportional to the magnitude of the matrix entry at that location. In this case, the matrix is a windowed Fourier transform of an audio signal spectrum such that the frequency of the audio signal as a function of time can be observed. This spectrogram shows the frequencies of my voice while singing the first bar of Hey Jude; the bright pixels below 200 Hz show the base note, while the fainter lines at integer multiples of the base notes show resonant harmonics.

Spectrogram of the first bar of Hey Jude, showing the frequency amplitude coefficients changing over time as the intensity of the pixels in the heat map.

Intuition behind Input Feature Vector

The Input Feature Vectors are Mel Frequency Cepstrum Coefficients (MFCCs).

Mel : is actually a scale used to measure the Pitch vs Frequency. The formula to convert from frequency scale to Mel is : m=2595 log(1+(f/700)) (where the log is to the base 10).

Cepstrum : Its just a fancy term for the Fourier transform (FT) of the log spectrum of any signal. The Mel Filterbank is simply a set of triangular filters in the frequency domain.The first filter is very narrow and gives an indication of how much energy exists near 0 Hertz. As the frequencies get higher our filters get wider as we become less concerned about variations. This roughly gives how much energy occurs at each spot. The Mel Filterbanks gives an idea of how much energy is present in each frequency region.This is found out by multiplying the power spectrum of each frame with each Mel Filterbank.


The Logarithmic scale in the MFCC is motivated by the Human hearing.We do not hear loudness on a linear scale.Generally to double the volume of a sound we need to put 8 times as much energy into it. This means that large variations in energy may not sound all that different if the sound is loud to begin with. The Log helps compute the cepstral mean subtraction ,which is a channel normalization technique.It is usually done when the channel varies too much.It subtracts the mean on the cepstral signal.

Then the DCT is taken, the coeffients of which are called mel frequency cepstral coefficients.

Input Vector Format

For each audio waveform in the TIMIT dataset, the Fourier coefficients were computed with a sliding Discrete Fourier Transform (DFT). The window duration used was 10 ms, corresponding to [math]n_s = 80[/math] samples per DFT since each waveform in the corpus was digitally registered with a sampling frequency of [math]f_s = 16[/math] kHz, producing 40 unique coefficients at each timestep [math]t[/math], [math]\{c^{[t]}_k\}_{k=1}^{40}[/math]. In addition, the first and second time derivatives of the coefficients between adjacent DFT windows were computed (the methodology is unspecified, however most likely this was performed with a numerical central difference technique). Thus, the input vector to the network at step [math]t[/math] was the concatenated vector

[math]{{\mathbf{x}}}_t = [c^{[t]}_1, \frac{d}{dt}c^{[t]}_1, \frac{d^2}{dt^2}c^{[t]}_1, c_2^{[t]}, \frac{d}{dt}c^{[t]}_2, \frac{d^2}{dt^2}c^{[t]}_2 \ldots]^T.[/math]

Finally, an additional preprocessing step was performed: each input vector was normalized such that the dataset had zero mean and unit variance.

RNN Transducer

When building a speech recognition classifier it is important to note that the length of the input and output sequences are of different lengths (sound data to phonemes). Additionally, RNNs require segmented input data. One approach to solve both these problems is to align the output (label) to the input (sound data), but more often than not an aligned dataset is not available. In this paper, the Connectionist Temporal Classification (CTC) method is used to create a probability distribution between inputs and output sequences. This is augmented with an RNN that predicts phonemes given the previous phonemes. The two predictions are then combined into a feed-forward network. The authors call this approach an RNN Transducer. From the distribution of the RNN and CTC, a maximum likelihood decoding for a given input can be computed to find the corresponding output label.

[math]h(x) = \arg \max_{l \in L^{\leq T}} P(l | x)[/math]


  • [math]h(x)[/math]: classifier
  • [math]x[/math]: input sequence
  • [math]l[/math]: label
  • [math]L[/math]: alphabet
  • [math]T[/math]: maximum sequence length
  • [math]P(l | x)[/math]: probability distribution of [math]l[/math] given [math]x[/math]

The value for [math]h(x)[/math] cannot computed directly, it is approximated with methods such as Best Path, and Prefix Search Decoding, the authors has chosen to use a graph search algorithm called Beam Search.

Network Output Layer

Two different network output layers were used, however most experimental results were reported for a simple softmax probability distribution vector over the set of [math]K = 62[/math] symbols, corresponding to the 61 phonemes in the corpus and an additional null symbol indicating that no phoneme distinct from the previous one was detected. This model is referred to as a Connectionist Temporal Classification (CTC) output function. The other (more complicated) output layer was not rigorously compared with a softmax output, and had nearly identical performance; this summary defers a description of this method, a so-called RNN transducer to the original paper.

Network Training Procedure

The parameters in all ANNs were determined using Stochastic Gradient Descent with a fixed update step size (learning rate) of [math]10^{-4}[/math] and a Nesterov momentum term of 0.9. The initial parameters were uniformly randomly drawn from [math][-0.1,0.1][/math]. The optimization procedure was initially run with data instances from the standard 462 speaker training set of the TIMIT corpus. As a stopping criterion for the training, a secondary testing subset of 50 speakers was used on which the phoneme error rate (PER) was computed in each iteration of the optimization algorithm. The initial training phase for each network was halted once the PER stopped decreasing on the training set; using the parameters at this point as the initial weights, the optimization procedure was then re-run with Gaussian noise with zero mean and [math]\sigma = 0.075[/math] added element-wise to the parameters in for each input vector instance [math]({{\mathbf{x}}}_1,\ldots, {{\mathbf{x}}}_T)[/math] as a form of regularization. The second optimization procedure was again halted once the PER stopped decreasing on the testing dataset. Multiple trials in each of these numerical experiments were not performed, and as such, the variability in performance due to the initial values of the parameters in the optimization routine is unknown.

TIMIT Corpus Experiments & Results

Numerical Experiments

To investigate the performance of the Bidirectional LSTM architecture as a function of depth, numerical experiments were conducted with networks with [math]N \in \{1,2,3,5\}[/math] layers and 250 hidden units per layer. These are denoted in the paper by the network names CTC-[math]N[/math]L-250H (where [math]N[/math] is the layer depth), and are summarized with the number of free model parameters in the table below.

Network Name # of parameters
CTC-1l-250h 0.8M
CTC-2l-250h 2.3M
CTC-3l-250h 3.8M
CTC-5l-250h 6.8M

Additional experiments included: a 1-layer model with 3.8M weights, a 3-layer bidirectional ANN with [math]\tanh[/math] activation functions rather than LSTM, a 3-layer unidirectional LSTM model with 3.8M weights (the same number of free parameters as the bidirectional 3-layer LSTM model). Finally, two experiments were performed with a bidirectional LSTM model with with 3 hidden layers each with 250 hidden units, and an RNN transducer output function. One of these experiments using uniformly randomly initialized parameters, and the other using the final (hidden) parameter weights from the CTC-3L-250H model as the initial paratemer values in the optimization algorithm. The names of these experiments are summarized below, where TRANS and PRETRANS denote the RNN transducer experiments initialized randomly, and using (pretrained) parameters from the CTC-3L-250H model, respectively. The suffices UNI and TANH denote the unidirectional and [math]\tanh[/math] networks, respectively.

Network Name # of parameters
CTC-1l-622h 3.8M
CTC-3l-421h-uni 3.8M
CTC-3l-500h-tanh 3.7M
Trans-3l-250h 4.3M
PreTrans-3l-250h 4.3M


The percentage phoneme error rates and number of epochs in the SGD optimization procedure for the LSTM experiments on the TIMIT dataset with varying network depth are shown below. The PER can be seen to decrease monotonically, however there is negligible difference between 3 and 5 layers—it is possible that the 0.2% difference is within statistical fluctuations induced by the SGD optimization routine and initial parameter values. Note that the allocation of the epochs into either the initial training without noise or the second optimization routine with Gaussian noise added (or both) is unspecified in the paper.

Network # of Parameters Epochs PER
CTC-1l-250h 0.8M 82 23.9%
CTC-2l-250h 2.3M 55 21.0%
CTC-3l-250h 3.8M 124 18.6%
CTC-5l-250h 6.8M 150 18.4%

The second set of PER results are shown below. The unidirectional LSTM architecture CTC-3L-421H-UNI achieves an error rate that is greater than the CTC-3L-250H model by 1 percentage point. No further comparative experiments between unidirectional and bidirectional models were given, however, and the margin of statistical uncertainty is unknown; thus the 1% (absolute) difference may or may not be significant. The TRANS-3L-250H model achieves a nearly identical PER to the CTC softmax model (0.3%) difference, however note that it has 0.5M more parameters due to the additional classification network at the output, and is hence not an entirely fair comparison since it has a greater dimensionality. The pretrained model PRETRANS-3L-250H also has 4.3M parameters and sees the best performance with a 17.5% error rate. Note that the difference in training of these two RNN transducer models is primarily in their initialization: the PRETRANS model was initialized using the trained weights of the CTC-3L-250H model (for the hidden layers). Thus, this difference in error rate of 0.6% is the direct result of a different starting iterates in the optimization procedure, which must be kept in mind when comparing between models.

Network # of Parameters Epochs PER
CTC-1l-622h 3.8M 87 23.0%
CTC-3l-500h-tanh 3.7M 107 37.6%
CTC-3l-421h-uni 3.8M 115 19.6%
Trans-3l-250h 4.3M 112 18.3%
PreTrans-3l-250h 4.3M 144 17.7%

Further works

The first two authors developed the method to be able to readily be integrated into word-level language models <ref> Graves, A.; Jaitly, N.; Mohamed, A.-R, “Hybrid speech recognition with Deep Bidirectional LSTM," [1]</ref>. They used a hybrid approach where frame-level acoustic targets produced by a forced alignment given by a GMM-HMM system.


<references />