graves et al., Speech recognition with deep recurrent neural networks

From statwiki
Revision as of 00:54, 4 November 2015 by Mbhynes (talk | contribs)
Jump to navigation Jump to search

Overview

This document is a summary of the paper Speech recognition with deep recurrent neural networks by A. Graves, A.-R. Mohammed, and G. Hinton, which appeared in the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). The first and third authors are Artificial Neural Network (ANN) researchers, while Mohammed works in the field of automatic speech recognition.

The paper presents the application of bidirectional multilayer Long Short-term Memory (LSTM) ANNs with 1–5 layers to phoneme recognition on the TIMIT acoustic phonme corpus, which is the standard benchmark in the field of acoustic recognition, extending the previous work by Mohammed and Hinton on this topic using Deep Belief Networks. The TIMIT corpus contains audio recordings of 6300 setences spoken by 630 (American) English speakers from 8 regions with distinct dialects, where each recording has manually labelled transcriptions of the phonemes spoken alonside timestamp information.

The deep LSTM networks presented with 3 or more layers obtain phoneme classfication error rates of 19.6% or less, with one model obtaining 17.7%, which was the best result reported in the literature at the time, outperforming the previous record of 20.7% achieved by Mohammed et al. Furthermore, the error rate decreases monotonically with LSTM network depth for 1–5 layers. While the bidirectional LSTM model performs well on the TIMIT corpus, any potential advantage of bidirectional over unidirectional LSTM network models, cannot be determined from this paper since the performance comparison is across different numbers of iterations taken in the optimization algorithm used to train the models, and multiple trials for statistical validity were not performed.

Deep RNN models considered by Graves et al.

In this paper, Graves et al. use deep LSTM network models. We briefly review recurrent neural networks, which form the basis of the more complicated LSTM network that has composite [math]\displaystyle{ \mathcal{H} }[/math] functions instead of sigmoids and additional parameter vectors associated with the state of each neuron. Finally, a desription of bidirectional ANNs is given, which is used throughout the numerical experiments. However, since a rigorous comparison between unidirectional and bidirectional ANNs is not performed in the paper, the bidirectional model is not elaborated upon.

Recurrent Neural Networks

Recall that a standard 1-layer recurrent neural network (RNN) computes the hidden vector sequence [math]\displaystyle{ {\boldsymbol h} = ({{\mathbf{h}}}_1,\ldots,{{\mathbf{h}}}_T) }[/math] and output vector sequence [math]\displaystyle{ {{\boldsymbol {{\mathbf{y}}}}}= ({{\mathbf{y}}}_1,\ldots,{{\mathbf{y}}}_T) }[/math] from an input vector sequence [math]\displaystyle{ {{\boldsymbol {{\mathbf{x}}}}}= ({{\mathbf{x}}}_1,\ldots,{{\mathbf{x}}}_T) }[/math]. by iterating the following two equations from [math]\displaystyle{ t=1 }[/math] to [math]\displaystyle{ T }[/math]:

[math]\displaystyle{ \begin{aligned} {\label{eq:rnn_hidden}} {{\mathbf{h}}}_t &= \begin{cases} {\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{\mathbf{b_{h}}}}}\right) &\quad t = 1\\ {\mathcal{H}}\left({{{{\mathbf{W}}}_{x h}}}{{\mathbf{x}}}_t + {{{{\mathbf{W}}}_{h h}}}{{\mathbf{h}}}_{t-1} + {{{\mathbf{b_{h}}}}}\right) &\quad \text{else} \end{cases}\\ {{\mathbf{y}}}_t &= {{{{\mathbf{W}}}_{h y}}}{{\mathbf{h}}}_t + {{{\mathbf{b_{y}}}}}\end{aligned} }[/math]

The [math]\displaystyle{ {{\mathbf{W}}} }[/math] terms are the parameter matrices with subscripts denoting the layer location (e.g. [math]\displaystyle{ {{{{\mathbf{W}}}_{x h}}} }[/math] is the input-hidden weight matrix), and the offset [math]\displaystyle{ b }[/math] terms are bias vectors with appropriate subscripts (e.g. [math]\displaystyle{ {{{\mathbf{b_{h}}}}} }[/math] is hidden bias vector). The function [math]\displaystyle{ {\mathcal{H}} }[/math] is an elementwise vector function with a range of [math]\displaystyle{ [0,1] }[/math] for each component in the hidden layer.

This paper considers multilayer RNN architectures, with the same hidden layer function used for all [math]\displaystyle{ N }[/math] layers. In this model, the hidden vector in the [math]\displaystyle{ n }[/math]th layer, [math]\displaystyle{ {\boldsymbol h}^n }[/math], is generated by the rule

[math]\displaystyle{ \begin{aligned} {\label{eq:deep_rnn_hidden}} {{\mathbf{h}}}^n_t &= {\mathcal{H}}\left({{\mathbf{W}}}_{h^{n-1}h^{n}} {{\mathbf{h}}}^{n-1}_t + {{\mathbf{W}}}_{h^{n}h^{n}} {{\mathbf{h}}}^n_{t-1} + {{{\mathbf{b_{h}}}}}^n \right)\end{aligned} }[/math]

where [math]\displaystyle{ {\boldsymbol h}^0 = {{\boldsymbol {{\mathbf{x}}}}} }[/math]. The final network output vector in the [math]\displaystyle{ t }[/math]th step of the output sequence, [math]\displaystyle{ {{\mathbf{y}}}_t }[/math], is

[math]\displaystyle{ \begin{aligned} {{\mathbf{y}}}_t &= {{{\mathbf{W}}}_{h^N y}} {{\mathbf{h}}}^N_t + {{{\mathbf{b_{y}}}}}.\end{aligned} }[/math]

Long Short-term Memory Architecture

Graves et al. consider a Long Short-Term Memory (LSTM) architecture from Gers et al. . This model replaces [math]\displaystyle{ \mathcal{H}(\cdot) }[/math] by a composite function that incurs additional parameter matrices, and hence a higher dimensional model. Each neuron in the network (i.e.  row of a parameter matrix [math]\displaystyle{ {{\mathbf{W}}} }[/math]) has an associated state vector [math]\displaystyle{ {{\mathbf{c}}}_t }[/math] at step [math]\displaystyle{ t }[/math], which is a function of the previous [math]\displaystyle{ {{\mathbf{c}}}_{t-1} }[/math], the input [math]\displaystyle{ {{\mathbf{x}}}_t }[/math] at step [math]\displaystyle{ t }[/math], and the previous step’s hidden state [math]\displaystyle{ {{\mathbf{h}}}_{t-1} }[/math] as

[math]\displaystyle{ \begin{aligned} {{\mathbf{c}}}_t &= {{\mathbf{f}}}_t \circ {{\mathbf{c}}}_{t-1} + {{\mathbf{i}}}_t \circ \tanh \left({{{\mathbf{W}}}_{x {{\mathbf{c}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{c}}}}} {{\mathbf{h}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{c}}}\right)\\\end{aligned} }[/math]

where [math]\displaystyle{ \circ }[/math] denotes the Hadamard product (elementwise vector multiplication), the vector [math]\displaystyle{ {{\mathbf{i}}}_t }[/math] denotes the so-called input vector to the cell that generated by the rule

[math]\displaystyle{ \begin{aligned} {{\mathbf{i}}}_t &= \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{i}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{i}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}} {{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{i}}}\right),\end{aligned} }[/math]

and [math]\displaystyle{ {{\mathbf{f}}}_t }[/math] is the forget gate vector, which is given by

[math]\displaystyle{ \begin{aligned} {{\mathbf{f}}}_t &= \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{f}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{f}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{f}}}}} {{\mathbf{c}}}_{t-1} + {{\mathbf{b}}}_{{\mathbf{f}}}\right)\\\end{aligned} }[/math]

Each [math]\displaystyle{ {{\mathbf{W}}} }[/math] matrix and bias vector [math]\displaystyle{ {{\mathbf{b}}} }[/math] is a free parameter in the model and must be trained. Since [math]\displaystyle{ {{\mathbf{f}}}_t }[/math] multiplies the previous state [math]\displaystyle{ {{\mathbf{c}}}_{t-1} }[/math] in a Hadamard product with each element in the range [math]\displaystyle{ [0,1] }[/math], it can be understood to reduce or dampen the effect of [math]\displaystyle{ {{\mathbf{c}}}_{t-1} }[/math] relative to the new input [math]\displaystyle{ {{\mathbf{i}}}_t }[/math]. The final hidden output state is then

[math]\displaystyle{ \begin{aligned} {{\mathbf{h}}}_t &= \sigma\left({{{\mathbf{W}}}_{x {{\mathbf{o}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{h {{\mathbf{o}}}}} {{\mathbf{h}}}_{t-1} + {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{o}}}}} {{\mathbf{c}}}_{t} + {{\mathbf{b}}}_{{\mathbf{o}}}\right)\circ \tanh({{\mathbf{c}}}_t) \\\end{aligned} }[/math]

In all of these equations, [math]\displaystyle{ \sigma }[/math] denotes the logistic sigmoid function. Note furthermore that [math]\displaystyle{ {{\mathbf{i}}} }[/math], [math]\displaystyle{ {{\mathbf{f}}} }[/math], [math]\displaystyle{ {{\mathbf{o}}} }[/math] and [math]\displaystyle{ {{\mathbf{c}}} }[/math] all of the same dimension as the hidden vector [math]\displaystyle{ h }[/math]. In addition, the weight matrices from the cell to gate vectors (e.g. [math]\displaystyle{ {{{\mathbf{W}}}_{{{\mathbf{c}}}{{\mathbf{i}}}}} }[/math]) are diagonal, such that each parameter matrix is merely a scaling matrix.

Bidirectional RNNs

A bidirectional RNN adds another layer of complexity by computing 2 hidden vectors per layer. Neglecting the [math]\displaystyle{ n }[/math] superscripts for the layer index, the forward hidden vector is determined through the conventional recursion as

[math]\displaystyle{ \begin{aligned} {\overrightarrow{{{\mathbf{h}}}}}_t &= {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overrightarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}{\overrightarrow{{{\mathbf{h}}}}}}} {\overrightarrow{{{\mathbf{h}}}}}_{t-1} + {{\mathbf{b_{{\overrightarrow{{{\mathbf{h}}}}}}}}} \right),\\\end{aligned} }[/math]

while the backward hidden state is determined recursively from the reversed sequence [math]\displaystyle{ ({{\mathbf{x}}}_T,\ldots,{{\mathbf{x}}}_1) }[/math] as

[math]\displaystyle{ \begin{aligned} {\overleftarrow{{{\mathbf{h}}}}}_t &= {\mathcal{H}}\left({{{\mathbf{W}}}_{x {\overleftarrow{{{\mathbf{h}}}}}}} {{\mathbf{x}}}_t + {{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}{\overleftarrow{{{\mathbf{h}}}}}}} {\overleftarrow{{{\mathbf{h}}}}}_{t+1} + {{\mathbf{b_{{\overleftarrow{{{\mathbf{h}}}}}}}}}\right)\\\end{aligned} }[/math]

The final output for the single layer state is then an affine transformation of [math]\displaystyle{ {\overrightarrow{{{\mathbf{h}}}}}_t }[/math] and [math]\displaystyle{ {\overleftarrow{{{\mathbf{h}}}}}_t }[/math] as [math]\displaystyle{ \begin{aligned} {{\mathbf{y}}}_t &= {{{\mathbf{W}}}_{{\overrightarrow{{{\mathbf{h}}}}}y}} {\overrightarrow{{{\mathbf{h}}}}}_t + {{{\mathbf{W}}}_{{\overleftarrow{{{\mathbf{h}}}}}y}} {\overleftarrow{{{\mathbf{h}}}}}_t + {{{\mathbf{b_{y}}}}}\end{aligned} }[/math]

The combination of LSTM with bidirectional dependence has been previously used by Graves and Schmidhuber, and is extended to multilayer networks in this paper.

TIMIT Experiments

This section describes the phoneme classification experiments performed with the TIMIT corpus. An overview of the input timeseries audio data prepocessing into frequency domain vectors is given, and the optimization techniques are described.

Network Inputs: frequency domain processing

Recall that for a real, periodic signal [math]\displaystyle{ {f(t)} }[/math], the Fourier transform

[math]\displaystyle{ {F(\omega)}= \int_{-\infty}^{\infty}e^{-i\omega t} {f(t)}dt }[/math]

can be represented for discrete samples [math]\displaystyle{ {f_0, f_1, \cdots f_{N-1}} }[/math] as

[math]\displaystyle{ F_k\ = \sum_{n=0}^{N-1} x_n \cdot e^{-i \frac{k n \pi}{N}} }[/math],

where [math]\displaystyle{ {F_k} }[/math] are the discrete coefficients of the (amplitude) spectral distribution of the signal [math]\displaystyle{ f }[/math] in the frequency domain. This is a particularly powerful representation of audio data, since the modulation produced by the tongue and lips when shifting position while the larynx vibrates induces the frequency changes that make up a phonetic alphabet. A particular example is the spectrogram below. A spectrogram is a heat map representation of a matrix of data in which each pixel has intensity proportional to the magnitude of the matrix entry at that location. In this case, the matrix is a windowed Fourier transform of an audio signal spectrum such that the frequency of the audio signal as a function of time can be observed. This spectrogram shows the frequencies of my voice while singing the first bar of Hey Jude; the bright pixels below 200 Hz show the base note, while the fainter lines at integer multiples of the base notes show resonant harmonics.