graves et al., Speech recognition with deep recurrent neural networks: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
= Overview =


= Introduction =
This document is a summary of the paper ''Speech recognition with deep recurrent neural networks'' by A. Graves, A.-R. Mohammed, and G. Hinton, which appeared in the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). The first and third authors are Artificial Neural Network (ANN) researchers, while Mohammed works in the field of automatic speech recognition.


Hello world! This is section [sec:intro].
The paper presents the application of bidirectional multilayer Long Short-term Memory (LSTM) ANNs with 1–5 layers to phoneme recognition on the TIMIT acoustic phonme corpus, which is the standard benchmark in the field of acoustic recognition, extending the previous work by Mohammed and Hinton on this topic using Deep Belief Networks. The TIMIT corpus contains audio recordings of 6300 setences spoken by 630 (American) English speakers from 8 regions with distinct dialects, where each recording has manually labelled transcriptions of the phonemes spoken alonside timestamp information.


Testing pandoc latex to wiki math conversions: <math>x = \sum_{i=1}^{n}\int_\alpha f(\alpha)^i\,d\alpha</math>
The deep LSTM networks presented with 3 or more layers obtain phoneme classfication error rates of 19.6% or less, with one model obtaining 17.7%, which was the best result reported in the literature at the time, outperforming the previous record of 20.7% achieved by Mohammed et al. Furthermore, the error rate decreases monotonically with LSTM network depth for 1–5 layers. While the bidirectional LSTM model performs well on the TIMIT corpus, any potential advantage of bidirectional over unidirectional LSTM network models, cannot be determined from this paper since the performance comparison is across different numbers of iterations taken in the optimization algorithm used to train the models, and multiple trials for statistical validity were not performed.


Now let’s try an enumerated list:
= Deep RNN models considered by Graves et al. =


# <math>\alpha</math>
In this paper, Graves et al. use deep LSTM network models. We briefly review recurrent neural networks, which form the basis of the more complicated LSTM network that has composite <math>\mathcal{H}</math> functions instead of sigmoids and additional parameter vectors associated with the ''state'' of each neuron. Finally, a desription of ''bidirectional'' ANNs is given, which is used throughout the numerical experiments. However, since a rigorous comparison between unidirectional and bidirectional ANNs is not performed in the paper, the bidirectional model is not elaborated upon.
# <math>\beta</math>
# <math>\gamma</math>


= References =
== Recurrent Neural Networks ==


<span>10</span>
Recall that a standard 1-layer recurrent neural network (RNN) computes the hidden vector sequence <math>{\boldsymbol h} = ({{\mathbf{h}}}_1,\ldots,{{\mathbf{h}}}_T)</math> and output vector sequence <math>{{\boldsymbol {{\mathbf{y}}}}}= ({{\mathbf{y}}}_1,\ldots,{{\mathbf{y}}}_T)</math> from an input vector sequence <math>{{\boldsymbol {{\mathbf{x}}}}}= ({{\mathbf{x}}}_1,\ldots,{{\mathbf{x}}}_T)</math>. by iterating the following two equations from <math>t=1</math> to <math>T</math>:


H.A. Bourlard and N. Morgan, , Kluwer Academic Publishers, 1994.
<math>\begin{aligned}
{\label{eq:rnn_hidden}}
{{\mathbf{h}}}_t &= \begin{cases}
        {\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{\mathbf{b_{h}}}}}\right) &\quad t = 1\\
        {\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{{\mathbf{W}}}_{h h}}}{{\mathbf{h}}}_{t-1} + {{{\mathbf{b_{h}}}}}\right) &\quad \text{else}
    \end{cases}\\
{{\mathbf{y}}}_t &= {{{{\mathbf{W}}}_{h y}}}{{\mathbf{h}}}_t + {{{\mathbf{b_{y}}}}}\end{aligned}</math>


Qifeng Zhu, Barry Chen, Nelson Morgan, and Andreas Stolcke, “Tandem connectionist feature extraction for conversational speech recognition,” in <span>''International Conference on Machine Learning for Multimodal Interaction''</span>, Berlin, Heidelberg, 2005, MLMI’04, pp. 223–231, Springer-Verlag.
The <math>{{\mathbf{W}}}</math> terms are the parameter matrices with subscripts denoting the layer location (<span>e.g. </span><math>{{{{\mathbf{W}}}_{x h}}}</math> is the input-hidden weight matrix), and the offset <math>b</math> terms are bias vectors with appropriate subscripts (<span>e.g. </span><math>{{{\mathbf{b_{h}}}}}</math> is hidden bias vector). The function <math>{\mathcal{H}}</math> is an elementwise vector function with a range of <math>[0,1]</math> for each component in the hidden layer.


A. Mohamed, G.E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” , vol. 20, no. 1, pp. 14 –22, jan. 2012.
This paper considers multilayer RNN architectures, with the same hidden layer function used for all <math>N</math> layers. In this model, the hidden vector in the <math>n</math>th layer, <math>{\boldsymbol h}^n</math>, is generated by the rule


G. Hinton, Li Deng, Dong Yu, G.E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,” , vol. 29, no. 6, pp. 82 –97, nov. 2012.
<math>\begin{aligned}
{\label{eq:deep_rnn_hidden}}
{{\mathbf{h}}}^n_t &= {\mathcal{H}}\left({{\mathbf{W}}}_{h^{n-1}h^{n}} {{\mathbf{h}}}^{n-1}_t +
{{\mathbf{W}}}_{h^{n}h^{n}} {{\mathbf{h}}}^n_{t-1} + {{{\mathbf{b_{h}}}}}^n \right)\end{aligned}</math>


A. J. Robinson, <span>An Application of Recurrent Nets to Phone Probability Estimation</span>,” , vol. 5, no. 2, pp. 298–305, 1994.
where <math>{\boldsymbol h}^0 = {{\boldsymbol {{\mathbf{x}}}}}</math>. The final network output vector in the <math>t</math>th step of the output sequence, <math>{{\mathbf{y}}}_t</math>, is


Oriol Vinyals, Suman Ravuri, and Daniel Povey, “<span>Revisiting Recurrent Neural Networks for Robust ASR</span>,” in <span>''ICASSP''</span>, 2012.
<math>\begin{aligned}
{{\mathbf{y}}}_t &= {{{\mathbf{W}}}_{h^N y}} {{\mathbf{h}}}^N_t + {{{\mathbf{b_{y}}}}}.\end{aligned}</math>


A. Maas, Q. Le, T. O’Neil, O. Vinyals, P. Nguyen, and A. Ng, “Recurrent neural networks for noise reduction in robust asr,” in <span>''INTERSPEECH''</span>, 2012.
== Long Short-term Memory Architecture ==


A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “<span>Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks</span>,” in <span>''ICML''</span>, Pittsburgh, USA, 2006.
Graves et al. consider a Long Short-Term Memory (LSTM) architecture from Gers et al. . This model replaces <math>\mathcal{H}(\cdot)</math> by a composite function that incurs additional parameter matrices, and hence a higher dimensional model. Each neuron in the network (<span>i.e. </span> row of a parameter matrix <math>{{\mathbf{W}}}</math>) has an associated state vector <math>{{\mathbf{c}}}_t</math> at step <math>t</math>, which is a function of the previous <math>{{\mathbf{c}}}_{t-1}</math>, the input <math>{{\mathbf{x}}}_t</math> at step <math>t</math>, and the previous step’s hidden state <math>{{\mathbf{h}}}_{t-1}</math> as


A. Graves, , vol. 385, Springer, 2012.
<math>\begin{aligned}
{{\mathbf{c}}}_t &= {{\mathbf{f}}}_t \circ {{\mathbf{c}}}_{t-1} + {{\mathbf{i}}}_t \circ \tanh
    \left({{{\mathbf{W}}}_{x {{\mathbf{c}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{c}}}}} {{\mathbf{h}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{c}}}\right)\\\end{aligned}</math>


A. Graves, “Sequence transduction with recurrent neural networks,” in <span>''ICML Representation Learning Worksop''</span>, 2012.
where <math>\circ</math> denotes the Hadamard product (elementwise vector multiplication), the vector <math>{{\mathbf{i}}}_t</math> denotes the so-called ''input'' vector to the cell that generated by the rule


S. Hochreiter and J. Schmidhuber, “<span>Long Short-Term Memory</span>,” , vol. 9, no. 8, pp. 1735–1780, 1997.
<math>\begin{aligned}
{{\mathbf{i}}}_t &= \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{i}}}}} {{\mathbf{x}}}_t +
    {{{\mathbf{W}}}_{h {{\mathbf{i}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}
    {{\mathbf{c}}}_{t-1}  + {{\mathbf{b}}}_{{\mathbf{i}}}\right),\end{aligned}</math>


A. Graves, S. Fernández, M. Liwicki, H. Bunke, and J. Schmidhuber, “<span>Unconstrained Online Handwriting Recognition with Recurrent Neural Networks</span>,” in <span>''NIPS''</span>. 2008.
and <math>{{\mathbf{f}}}_t</math> is the ''forget gate'' vector, which is given by


Alex Graves and Juergen Schmidhuber, “<span>Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks</span>,” in <span>''NIPS''</span>. 2009.
<math>\begin{aligned}
{{\mathbf{f}}}_t &= \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{f}}}}} {{\mathbf{x}}}_t +
    {{{\mathbf{W}}}_{h {{\mathbf{f}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{f}}}}}
    {{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{f}}}\right)\\\end{aligned}</math>


F. Gers, N. Schraudolph, and J. Schmidhuber, <span>Learning Precise Timing with <span>LSTM</span> Recurrent Networks</span>,” , vol. 3, pp. 115–143, 2002.
Each <math>{{\mathbf{W}}}</math> matrix and bias vector <math>{{\mathbf{b}}}</math> is a free parameter in the model and must be trained. Since <math>{{\mathbf{f}}}_t</math> multiplies the previous state <math>{{\mathbf{c}}}_{t-1}</math> in a Hadamard product with each element in the range <math>[0,1]</math>, it can be understood to reduce or dampen the effect of <math>{{\mathbf{c}}}_{t-1}</math> relative to the new input <math>{{\mathbf{i}}}_t</math>. The final hidden output state is then


M. Schuster and K. K. Paliwal, “<span>Bidirectional Recurrent Neural Networks</span>,” , vol. 45, pp. 2673–2681, 1997.
<math>\begin{aligned}
{{\mathbf{h}}}_t &= \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{o}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{o}}}}} {{\mathbf{h}}}_{t-1}
    + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{o}}}}} {{\mathbf{c}}}_{t} + {{\mathbf{b}}}_{{\mathbf{o}}}\right)\circ \tanh({{\mathbf{c}}}_t) \\\end{aligned}</math>


A. Graves and J. Schmidhuber, <span>Framewise Phoneme Classification with Bidirectional <span>LSTM</span> and Other Neural Network Architectures</span>,” , vol. 18, no. 5-6, pp. 602–610, June/July 2005.
In all of these equations, <math>\sigma</math> denotes the logistic sigmoid function. Note furthermore that <math>{{\mathbf{i}}}</math>, <math>{{\mathbf{f}}}</math>, <math>{{\mathbf{o}}}</math> and <math>{{\mathbf{c}}}</math> all of the same dimension as the hidden vector <math>h</math>. In addition, the weight matrices from the cell to gate vectors (<span>e.g. </span><math>{{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}}</math>) are ''diagonal'', such that each parameter matrix is merely a scaling matrix.


David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams, , pp. 696–699, MIT Press, 1988.
== Bidirectional RNNs ==


Geoffrey Zweig and Patrick Nguyen, “<span>SCARF: A segmental CRF speech recognition system</span>,” Tech. <span>R</span>ep., Microsoft Research, 2009.
A bidirectional RNN adds another layer of complexity by computing 2 hidden vectors per layer. Neglecting the <math>n</math> superscripts for the layer index, the ''forward'' hidden vector is determined through the conventional recursion as


Andrew W. Senior and Anthony J. Robinson, “Forward-backward retraining of recurrent neural networks,” in <span>''NIPS''</span>, 1995, pp. 743–749.
<math>\begin{aligned}
{\overrightarrow{{{\mathbf{h}}}}}_t &= {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overrightarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t +
{{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}{\overrightarrow{{{\mathbf{h}}}}}}} {\overrightarrow{{{\mathbf{h}}}}}_{t-1} + {{\mathbf{b_{{\overrightarrow{{{\mathbf{h}}}}}}}}} \right),\\\end{aligned}</math>


Abdel rahman Mohamed, Dong Yu, and Li Deng, “Investigation of full-sequence training of deep belief networks for speech recognition,” in <span>''in Interspeech''</span>, 2010.
while the ''backward'' hidden state is determined recursively from the ''reversed'' sequence <math>({{\mathbf{x}}}_T,\ldots,{{\mathbf{x}}}_1)</math> as


M. Lehr and I. Shafran, “Discriminatively estimated joint acoustic, duration, and language model for speech recognition,” in <span>''ICASSP''</span>, 2010, pp. 5542 –5545.
<math>\begin{aligned}
{\overleftarrow{{{\mathbf{h}}}}}_t &= {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overleftarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}{\overleftarrow{{{\mathbf{h}}}}}}} {\overleftarrow{{{\mathbf{h}}}}}_{t+1} + {{\mathbf{b_{{\overleftarrow{{{\mathbf{h}}}}}}}}}\right)\\\end{aligned}</math>


Kam-Chuen Jim, C.L. Giles, and B.G. Horne, “An analysis of noise in recurrent neural networks: convergence and generalization,” , vol. 7, no. 6, pp. 1424 –1438, nov 1996.
The final output for the single layer state is then an affine transformation of <math>{\overrightarrow{{{\mathbf{h}}}}}_t</math> and <math>{\overleftarrow{{{\mathbf{h}}}}}_t</math> as <math>\begin{aligned}
{{\mathbf{y}}}_t &= {{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}y}} {\overrightarrow{{{\mathbf{h}}}}}_t + {{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}y}} {\overleftarrow{{{\mathbf{h}}}}}_t + {{{\mathbf{b_{y}}}}}\end{aligned}</math>


Geoffrey E. Hinton and Drew van Camp, “Keeping the neural networks simple by minimizing the description length of the weights,” in <span>''COLT''</span>, 1993, pp. 5–13.
The combination of LSTM with bidirectional dependence has been previously used by Graves and Schmidhuber, and is extended to multilayer networks in this paper.


Alex Graves, “Practical variational inference for neural networks,” in <span>''NIPS''</span>, pp. 2348–2356. 2011.
= TIMIT Experiments =


DARPA-ISTO, , speech disc cd1-1.1 edition, 1990.
This section describes the phoneme classification experiments performed with the TIMIT corpus. An overview of the input timeseries audio data prepocessing into frequency domain vectors is given, and the optimization techniques are described.


Kai fu Lee and Hsiao wuen Hon, “Speaker-independent phone recognition using hidden markov models,” , 1989.
== Network Inputs: frequency domain processing ==


O. Abdel-Hamid, A. Mohamed, Hui Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition,” in <span>''ICASSP''</span>, march 2012, pp. 4277 –4280.
Recall that for a real, periodic signal <math>{f(t)}</math>, the Fourier transform
 
<math>{F(\omega)}= \int_{-\infty}^{\infty}e^{-i\omega t} {f(t)}dt</math>
 
can be represented for discrete samples <math>{f_0, f_1, \cdots
    f_{N-1}}</math> as
 
<math>F_k\ = \sum_{n=0}^{N-1} x_n \cdot e^{-i \frac{k n
\pi}{N}}</math>,
 
where <math>{F_k}</math> are the discrete coefficients of the (amplitude) spectral distribution of the signal <math>f</math> in the frequency domain. This is a particularly powerful representation of audio data, since the modulation produced by the tongue and lips when shifting position while the larynx vibrates induces the frequency changes that make up a phonetic alphabet. A particular example is the spectrogram below. A spectrogram is a heat map representation of a matrix of data in which each pixel has intensity proportional to the magnitude of the matrix entry at that location. In this case, the matrix is a windowed Fourier transform of an audio signal spectrum such that the frequency of the audio signal as a function of time can be observed. This spectrogram shows the frequencies of my voice while singing the first bar of ''Hey Jude''; the bright pixels below 200 Hz show the base note, while the fainter lines at integer multiples of the base notes show resonant harmonics.

Revision as of 00:54, 4 November 2015

Overview

This document is a summary of the paper Speech recognition with deep recurrent neural networks by A. Graves, A.-R. Mohammed, and G. Hinton, which appeared in the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). The first and third authors are Artificial Neural Network (ANN) researchers, while Mohammed works in the field of automatic speech recognition.

The paper presents the application of bidirectional multilayer Long Short-term Memory (LSTM) ANNs with 1–5 layers to phoneme recognition on the TIMIT acoustic phonme corpus, which is the standard benchmark in the field of acoustic recognition, extending the previous work by Mohammed and Hinton on this topic using Deep Belief Networks. The TIMIT corpus contains audio recordings of 6300 setences spoken by 630 (American) English speakers from 8 regions with distinct dialects, where each recording has manually labelled transcriptions of the phonemes spoken alonside timestamp information.

The deep LSTM networks presented with 3 or more layers obtain phoneme classfication error rates of 19.6% or less, with one model obtaining 17.7%, which was the best result reported in the literature at the time, outperforming the previous record of 20.7% achieved by Mohammed et al. Furthermore, the error rate decreases monotonically with LSTM network depth for 1–5 layers. While the bidirectional LSTM model performs well on the TIMIT corpus, any potential advantage of bidirectional over unidirectional LSTM network models, cannot be determined from this paper since the performance comparison is across different numbers of iterations taken in the optimization algorithm used to train the models, and multiple trials for statistical validity were not performed.

Deep RNN models considered by Graves et al.

In this paper, Graves et al. use deep LSTM network models. We briefly review recurrent neural networks, which form the basis of the more complicated LSTM network that has composite [math]\displaystyle{ \mathcal{H} }[/math] functions instead of sigmoids and additional parameter vectors associated with the state of each neuron. Finally, a desription of bidirectional ANNs is given, which is used throughout the numerical experiments. However, since a rigorous comparison between unidirectional and bidirectional ANNs is not performed in the paper, the bidirectional model is not elaborated upon.

Recurrent Neural Networks

Recall that a standard 1-layer recurrent neural network (RNN) computes the hidden vector sequence [math]\displaystyle{ {\boldsymbol h} = ({{\mathbf{h}}}_1,\ldots,{{\mathbf{h}}}_T) }[/math] and output vector sequence [math]\displaystyle{ {{\boldsymbol {{\mathbf{y}}}}}= ({{\mathbf{y}}}_1,\ldots,{{\mathbf{y}}}_T) }[/math] from an input vector sequence [math]\displaystyle{ {{\boldsymbol {{\mathbf{x}}}}}= ({{\mathbf{x}}}_1,\ldots,{{\mathbf{x}}}_T) }[/math]. by iterating the following two equations from [math]\displaystyle{ t=1 }[/math] to [math]\displaystyle{ T }[/math]:

[math]\displaystyle{ \begin{aligned} {\label{eq:rnn_hidden}} {{\mathbf{h}}}_t &= \begin{cases} {\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{\mathbf{b_{h}}}}}\right) &\quad t = 1\\ {\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{{\mathbf{W}}}_{h h}}}{{\mathbf{h}}}_{t-1} + {{{\mathbf{b_{h}}}}}\right) &\quad \text{else} \end{cases}\\ {{\mathbf{y}}}_t &= {{{{\mathbf{W}}}_{h y}}}{{\mathbf{h}}}_t + {{{\mathbf{b_{y}}}}}\end{aligned} }[/math]

The [math]\displaystyle{ {{\mathbf{W}}} }[/math] terms are the parameter matrices with subscripts denoting the layer location (e.g. [math]\displaystyle{ {{{{\mathbf{W}}}_{x h}}} }[/math] is the input-hidden weight matrix), and the offset [math]\displaystyle{ b }[/math] terms are bias vectors with appropriate subscripts (e.g. [math]\displaystyle{ {{{\mathbf{b_{h}}}}} }[/math] is hidden bias vector). The function [math]\displaystyle{ {\mathcal{H}} }[/math] is an elementwise vector function with a range of [math]\displaystyle{ [0,1] }[/math] for each component in the hidden layer.

This paper considers multilayer RNN architectures, with the same hidden layer function used for all [math]\displaystyle{ N }[/math] layers. In this model, the hidden vector in the [math]\displaystyle{ n }[/math]th layer, [math]\displaystyle{ {\boldsymbol h}^n }[/math], is generated by the rule

[math]\displaystyle{ \begin{aligned} {\label{eq:deep_rnn_hidden}} {{\mathbf{h}}}^n_t &= {\mathcal{H}}\left({{\mathbf{W}}}_{h^{n-1}h^{n}} {{\mathbf{h}}}^{n-1}_t + {{\mathbf{W}}}_{h^{n}h^{n}} {{\mathbf{h}}}^n_{t-1} + {{{\mathbf{b_{h}}}}}^n \right)\end{aligned} }[/math]

where [math]\displaystyle{ {\boldsymbol h}^0 = {{\boldsymbol {{\mathbf{x}}}}} }[/math]. The final network output vector in the [math]\displaystyle{ t }[/math]th step of the output sequence, [math]\displaystyle{ {{\mathbf{y}}}_t }[/math], is

[math]\displaystyle{ \begin{aligned} {{\mathbf{y}}}_t &= {{{\mathbf{W}}}_{h^N y}} {{\mathbf{h}}}^N_t + {{{\mathbf{b_{y}}}}}.\end{aligned} }[/math]

Long Short-term Memory Architecture

Graves et al. consider a Long Short-Term Memory (LSTM) architecture from Gers et al. . This model replaces [math]\displaystyle{ \mathcal{H}(\cdot) }[/math] by a composite function that incurs additional parameter matrices, and hence a higher dimensional model. Each neuron in the network (i.e.  row of a parameter matrix [math]\displaystyle{ {{\mathbf{W}}} }[/math]) has an associated state vector [math]\displaystyle{ {{\mathbf{c}}}_t }[/math] at step [math]\displaystyle{ t }[/math], which is a function of the previous [math]\displaystyle{ {{\mathbf{c}}}_{t-1} }[/math], the input [math]\displaystyle{ {{\mathbf{x}}}_t }[/math] at step [math]\displaystyle{ t }[/math], and the previous step’s hidden state [math]\displaystyle{ {{\mathbf{h}}}_{t-1} }[/math] as

[math]\displaystyle{ \begin{aligned} {{\mathbf{c}}}_t &= {{\mathbf{f}}}_t \circ {{\mathbf{c}}}_{t-1} + {{\mathbf{i}}}_t \circ \tanh \left({{{\mathbf{W}}}_{x {{\mathbf{c}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{c}}}}} {{\mathbf{h}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{c}}}\right)\\\end{aligned} }[/math]

where [math]\displaystyle{ \circ }[/math] denotes the Hadamard product (elementwise vector multiplication), the vector [math]\displaystyle{ {{\mathbf{i}}}_t }[/math] denotes the so-called input vector to the cell that generated by the rule

[math]\displaystyle{ \begin{aligned} {{\mathbf{i}}}_t &= \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{i}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{i}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}} {{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{i}}}\right),\end{aligned} }[/math]

and [math]\displaystyle{ {{\mathbf{f}}}_t }[/math] is the forget gate vector, which is given by

[math]\displaystyle{ \begin{aligned} {{\mathbf{f}}}_t &= \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{f}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{f}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{f}}}}} {{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{f}}}\right)\\\end{aligned} }[/math]

Each [math]\displaystyle{ {{\mathbf{W}}} }[/math] matrix and bias vector [math]\displaystyle{ {{\mathbf{b}}} }[/math] is a free parameter in the model and must be trained. Since [math]\displaystyle{ {{\mathbf{f}}}_t }[/math] multiplies the previous state [math]\displaystyle{ {{\mathbf{c}}}_{t-1} }[/math] in a Hadamard product with each element in the range [math]\displaystyle{ [0,1] }[/math], it can be understood to reduce or dampen the effect of [math]\displaystyle{ {{\mathbf{c}}}_{t-1} }[/math] relative to the new input [math]\displaystyle{ {{\mathbf{i}}}_t }[/math]. The final hidden output state is then

[math]\displaystyle{ \begin{aligned} {{\mathbf{h}}}_t &= \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{o}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{o}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{o}}}}} {{\mathbf{c}}}_{t} + {{\mathbf{b}}}_{{\mathbf{o}}}\right)\circ \tanh({{\mathbf{c}}}_t) \\\end{aligned} }[/math]

In all of these equations, [math]\displaystyle{ \sigma }[/math] denotes the logistic sigmoid function. Note furthermore that [math]\displaystyle{ {{\mathbf{i}}} }[/math], [math]\displaystyle{ {{\mathbf{f}}} }[/math], [math]\displaystyle{ {{\mathbf{o}}} }[/math] and [math]\displaystyle{ {{\mathbf{c}}} }[/math] all of the same dimension as the hidden vector [math]\displaystyle{ h }[/math]. In addition, the weight matrices from the cell to gate vectors (e.g. [math]\displaystyle{ {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}} }[/math]) are diagonal, such that each parameter matrix is merely a scaling matrix.

Bidirectional RNNs

A bidirectional RNN adds another layer of complexity by computing 2 hidden vectors per layer. Neglecting the [math]\displaystyle{ n }[/math] superscripts for the layer index, the forward hidden vector is determined through the conventional recursion as

[math]\displaystyle{ \begin{aligned} {\overrightarrow{{{\mathbf{h}}}}}_t &= {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overrightarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}{\overrightarrow{{{\mathbf{h}}}}}}} {\overrightarrow{{{\mathbf{h}}}}}_{t-1} + {{\mathbf{b_{{\overrightarrow{{{\mathbf{h}}}}}}}}} \right),\\\end{aligned} }[/math]

while the backward hidden state is determined recursively from the reversed sequence [math]\displaystyle{ ({{\mathbf{x}}}_T,\ldots,{{\mathbf{x}}}_1) }[/math] as

[math]\displaystyle{ \begin{aligned} {\overleftarrow{{{\mathbf{h}}}}}_t &= {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overleftarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}{\overleftarrow{{{\mathbf{h}}}}}}} {\overleftarrow{{{\mathbf{h}}}}}_{t+1} + {{\mathbf{b_{{\overleftarrow{{{\mathbf{h}}}}}}}}}\right)\\\end{aligned} }[/math]

The final output for the single layer state is then an affine transformation of [math]\displaystyle{ {\overrightarrow{{{\mathbf{h}}}}}_t }[/math] and [math]\displaystyle{ {\overleftarrow{{{\mathbf{h}}}}}_t }[/math] as [math]\displaystyle{ \begin{aligned} {{\mathbf{y}}}_t &= {{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}y}} {\overrightarrow{{{\mathbf{h}}}}}_t + {{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}y}} {\overleftarrow{{{\mathbf{h}}}}}_t + {{{\mathbf{b_{y}}}}}\end{aligned} }[/math]

The combination of LSTM with bidirectional dependence has been previously used by Graves and Schmidhuber, and is extended to multilayer networks in this paper.

TIMIT Experiments

This section describes the phoneme classification experiments performed with the TIMIT corpus. An overview of the input timeseries audio data prepocessing into frequency domain vectors is given, and the optimization techniques are described.

Network Inputs: frequency domain processing

Recall that for a real, periodic signal [math]\displaystyle{ {f(t)} }[/math], the Fourier transform

[math]\displaystyle{ {F(\omega)}= \int_{-\infty}^{\infty}e^{-i\omega t} {f(t)}dt }[/math]

can be represented for discrete samples [math]\displaystyle{ {f_0, f_1, \cdots f_{N-1}} }[/math] as

[math]\displaystyle{ F_k\ = \sum_{n=0}^{N-1} x_n \cdot e^{-i \frac{k n \pi}{N}} }[/math],

where [math]\displaystyle{ {F_k} }[/math] are the discrete coefficients of the (amplitude) spectral distribution of the signal [math]\displaystyle{ f }[/math] in the frequency domain. This is a particularly powerful representation of audio data, since the modulation produced by the tongue and lips when shifting position while the larynx vibrates induces the frequency changes that make up a phonetic alphabet. A particular example is the spectrogram below. A spectrogram is a heat map representation of a matrix of data in which each pixel has intensity proportional to the magnitude of the matrix entry at that location. In this case, the matrix is a windowed Fourier transform of an audio signal spectrum such that the frequency of the audio signal as a function of time can be observed. This spectrogram shows the frequencies of my voice while singing the first bar of Hey Jude; the bright pixels below 200 Hz show the base note, while the fainter lines at integer multiples of the base notes show resonant harmonics.